From: Roland S. <rsc...@hi...> - 2005-07-19 18:01:48
Attachments:
r200_crossbar.diff
|
Here's a somewhat experimental patch to enable GL_ARB_texture_env_crossbar on r200. It got more ugly than I wanted... Works with tests/crossbar, glean(texcombine), couldn't find anything more which uses it (well ut2k4 seems to, but I couldn't see any difference). There is some overhead associated with it unfortunately (figuring out what register to use for the output of the stages), I hope it's not too serious (might roughly double the amount of cpu cycles spent for that tex env stuff). Still, if you have all 6 texture units enabled and reference textures back and forth like mad (since you can have 3 arguments per environment, both for rgb and alpha, that means at the worst case you will need to reference all 6 registers in a single env stage) you're somewhat likely to hit a raster fallback I guess :-(. Only one register more and there would be no problem (as you can reference arbitrary texture sampling results, but only the previous tex env result)... I'm actually wondering how ATI solved that problem in their driver, I couldn't see an easy way out to avoid the fallback - even using the 2 additional tex env stages or the second phase of the fragment pipeline isn't going to fix the issue I think. Maybe someone else has a good idea? Roland |
From: Philipp K. K. <pk...@sp...> - 2005-07-19 19:06:03
|
Roland Scheidegger schrieb: > Here's a somewhat experimental patch to enable > GL_ARB_texture_env_crossbar on r200. It got more ugly than I wanted... > Works with tests/crossbar, glean(texcombine), couldn't find anything > more which uses it (well ut2k4 seems to, but I couldn't see any > difference). There's glest, a free strategy game, which uses GL_ARB_texture_env_crossbar. Philipp |
From: Ian R. <id...@us...> - 2005-07-22 06:33:48
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roland Scheidegger wrote: > I'm actually wondering how ATI solved that problem in their driver, I > couldn't see an easy way out to avoid the fallback - even using the 2 > additional tex env stages or the second phase of the fragment pipeline > isn't going to fix the issue I think. Maybe someone else has a good idea? So, for any set of texture environments there is an ordering of operations and an assignment of registers that will work. Once upon a time I wrote a python script that implemented a simple algorithm to do this. I'll have to see if I can dig it up. The algorithm works in two passes. The first pass identifies any texture stages and texture reads, if any, that do not contribute to the final result. I'm going to use the notation T# for a texture read and P for the previous result. If the texture environment is { {T0 + T1} {T3 - - T2} }, then T0, T1, and the result of adding them don't contribute to the final result. You can omit those stages entirely and freely use those registers as temporaries. The second pass assigns registers. Each T# gets assigned the next R#, in order. If T0, T1, and T4 contribute to the final result, they get assigned R0, R1, and R2. Next, each P gets assigned an available register. A register is available if its either unassigned or its value will not be read again. At any point, there is *always* an available register. I think this is mathematically provable, but it's way beyond my patience to do so. :) Here are a couple examples. I have left out the operations for clarity. I'm also going to simplify a bit. I assume 3 textures, 3 registers, 3 stages, and 2 reads per stage. Start: {T0, T2}, {P , T0}, {T1, P } Pass 1: {T0, T2}, {P , T0}, {T1, P } Pass 2.1: {R0, R2}, {P , R0}, {R1, P } Pass 2.2: {R0, R2}, {R2, R0}, {R1, R0} Start: {T0, T2}, {T1, T0}, {T1, P } Pass 1: {T1, T0}, {T1, P } Pass 2.1: {R1, R0}, {R1, P } Pass 2.2: {R1, R0}, {R1, R0} Working through this, I noticed something that I hadn't noticed before. This technique only works if each operation cannot access the entire register set. I first did it with 3 reads per stage, and I very quickly came up with some impossible examples. :) 3 reads w/6 registers will still work. The nice thing about this algorithm is that it not only works, but it eliminates "dead code" and unused textures. I don't know about the former, but the later can certainly improve the performance of ill written code. In addition, this same algorithm could be used to optimize ATI_fragment_program code. It should also make it possible to implement NV_texture_env_combine4, which is used by a lot more programs than ATI_texture_env_combine3. In both these cases you need to expand the notation to have multiple P values. Other optimizations are possible, but I never explored them. Most of the ones that I could think of are probably unlikely in practice. Doing things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 + T2}*2, {P + T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0} are possible, but probably not worth the effort. I think the right way to actually implement this in the driver is to convert texture env (be it ARB_texture_env_combine / ATI_texture_env_combine3 or NV_texture_env_combine4) into an ATI_fragment_program and optimize that. Doing it that way effectively kills two birds with one stone. We can get away with that here because the texture env will only ever require one pass. One nice thing about doing it that way is you can write an application that converts texture env scripts to ATI_fragment_programs. You can compare the direct implementation of the texture env with the generated ATI_fragment_program. That should be a *lot* easier to debug than doing it in the driver code! It's also worth noting that a similar technique can be applied in the i830 driver to implement ATI_texture_env_combine3. The i830 implements *most* of the required instructions. The unavailable instructions can be implemented by simpler operations (e.g., {T0*T1-T2} becomes {T0*T1} {P-T2}). Adding the optimization pass, especially if it *did* the optimizations that said were "probably not worth the effort", would reduce the chances of needing a fallback. An env like {T0*T1-T2} {P+T3} {P*C} {P+T0} would be optimized to {T0*T1+T3} {P-T2} {P*C+T0}. If you don't think you want to tackle this now, I'll gather up my python script and all my notes on the subject and file an enhancement bug. That way none of the information will get lost / forgotten. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFC4JM7X1gOwKyEAw8RAkGqAJkBbllTflRuCtOiV8PxwFtDiJMGLQCfaEhg ra2W8jQQae1odI95g/RwQ5Y= =cGeR -----END PGP SIGNATURE----- |
From: Roland S. <rsc...@hi...> - 2005-07-22 11:36:35
|
Ian Romanick wrote: >> I'm actually wondering how ATI solved that problem in their driver, >> I couldn't see an easy way out to avoid the fallback - even using >> the 2 additional tex env stages or the second phase of the >> fragment pipeline isn't going to fix the issue I think. Maybe >> someone else has a good idea? > > > So, for any set of texture environments there is an ordering of > operations and an assignment of registers that will work. Once upon > a time I wrote a python script that implemented a simple algorithm > to do this. I'll have to see if I can dig it up. > > The algorithm works in two passes. The first pass identifies any > texture stages and texture reads, if any, that do not contribute to > the final result. I'm going to use the notation T# for a texture > read and P for the previous result. If the texture environment is { > {T0 + T1} {T3 - - T2} }, then T0, T1, and the result of adding them > don't contribute to the final result. You can omit those stages > entirely and freely use those registers as temporaries. Yes, that's what my code does too, it uses the regs which contain unneeded textures as temporariers (it does not, however, eliminate the texture lookups nor the env stages, I didn't want to mess with that state for now but it could be done). > The second pass assigns registers. Each T# gets assigned the next > R#, in order. If T0, T1, and T4 contribute to the final result, they > get assigned R0, R1, and R2. Next, each P gets assigned an > available register. A register is available if its either unassigned > or its value will not be read again. At any point, there is > *always* an available register. I think this is mathematically > provable, but it's way beyond my patience to do so. :) Ah you also reorder the texture assignments. This one I didn't look at, looked like too much work (and it shouldn't make a difference). > Here are a couple examples. I have left out the operations for > clarity. I'm also going to simplify a bit. I assume 3 textures, 3 > registers, 3 stages, and 2 reads per stage. > > Start: {T0, T2}, {P , T0}, {T1, P } Pass 1: {T0, T2}, {P , T0}, > {T1, P } Pass 2.1: {R0, R2}, {P , R0}, {R1, P } Pass 2.2: {R0, R2}, > {R2, R0}, {R1, R0} > > Start: {T0, T2}, {T1, T0}, {T1, P } Pass 1: {T1, T0}, {T1, P } > Pass 2.1: {R1, R0}, {R1, P } Pass 2.2: {R1, R0}, {R1, R0} > > Working through this, I noticed something that I hadn't noticed > before. This technique only works if each operation cannot access the > entire register set. I first did it with 3 reads per stage, and I > very quickly came up with some impossible examples. :) 3 reads w/6 > registers will still work. but you can have 6 reads per stage (thanks to different alpha/rgb sources), e.g. the whole register set. And, here's a counterprove to your theory, even assuming only 3 reads per stage :-): {T4, T5, P} {T2, T3, P} {T0, T1, P} {T4, T5, P} {T2, T3, P} {T0, T1, P} How do you want to optimize that? In the first two stages you can't assign any reg as all 6 texture sampling results are needed again (that is, unless you analyze the whole "fragment program" and make a shorter one, mathematically equivalent - but with all the different operations possible plus scaling etc. this may not be possible). > The nice thing about this algorithm is that it not only works, but it > eliminates "dead code" and unused textures. I don't know about the > former, but the later can certainly improve the performance of ill > written code. In addition, this same algorithm could be used to > optimize ATI_fragment_program code. It should also make it possible > to implement NV_texture_env_combine4, which is used by a lot more > programs than ATI_texture_env_combine3. In both these cases you need > to expand the notation to have multiple P values. I thought about those unused textures too, is it worth bothering and do performance optimizations for crappy apps? Is such code even in widespread use? > Other optimizations are possible, but I never explored them. Most of > the ones that I could think of are probably unlikely in practice. > Doing things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 + > T2}*2, {P + T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0} > are possible, but probably not worth the effort. That gets close to the complexity of optimizing compilers, not my strength :-). But you're probably right the env stages are likely executed faster than the texture lookups I suppose (though I have no idea how fast exactly they are executed, something like 1 clock per stage?). In contrast to optimize away unused textures though there should be more opportunity for such optimizations. > I think the right way to actually implement this in the driver is to > convert texture env (be it ARB_texture_env_combine / > ATI_texture_env_combine3 or NV_texture_env_combine4) into an > ATI_fragment_program and optimize that. Doing it that way > effectively kills two birds with one stone. We can get away with > that here because the texture env will only ever require one pass. > One nice thing about doing it that way is you can write an > application that converts texture env scripts to > ATI_fragment_programs. You can compare the direct implementation of > the texture env with the generated ATI_fragment_program. That > should be a *lot* easier to debug than doing it in the driver code! > > It's also worth noting that a similar technique can be applied in the > i830 driver to implement ATI_texture_env_combine3. The i830 > implements *most* of the required instructions. The unavailable > instructions can be implemented by simpler operations (e.g., > {T0*T1-T2} becomes {T0*T1} {P-T2}). Adding the optimization pass, > especially if it *did* the optimizations that said were "probably not > worth the effort", would reduce the chances of needing a fallback. > An env like {T0*T1-T2} {P+T3} {P*C} {P+T0} would be optimized to > {T0*T1+T3} {P-T2} {P*C+T0}. Looks very nice, but quite complicated :-(. > If you don't think you want to tackle this now, I'll gather up my > python script and all my notes on the subject and file an enhancement > bug. That way none of the information will get lost / forgotten. Yes, that would be nice. Roland |
From: Roland S. <rsc...@hi...> - 2005-07-22 15:18:38
|
Roland Scheidegger wrote: > but you can have 6 reads per stage (thanks to different alpha/rgb > sources), e.g. the whole register set. Ah scratch that, wrote too fast. This if of course not an issue, since you can address the alpha and rgb regs separately too. I just didn't do it due to the added complexity, and because I'm not sure it would help that much in practice to avoid the fallback - how often are different sources for rgb and alpha used? Roland |
From: Patrick M. <pmc...@do...> - 2005-07-22 17:17:32
|
On Friday 22 July 2005 08:20 am, Roland Scheidegger wrote: > Roland Scheidegger wrote: > > but you can have 6 reads per stage (thanks to different alpha/rgb > > sources), e.g. the whole register set. > > Ah scratch that, wrote too fast. This if of course not an issue, since > you can address the alpha and rgb regs separately too. I just didn't do > it due to the added complexity, and because I'm not sure it would help > that much in practice to avoid the fallback - how often are different > sources for rgb and alpha used? I'd do it if it could cause a cool effect... =2D-=20 Patrick "Diablo-D3" McFarland || pmc...@do... "Computer games don't affect kids; I mean if Pac-Man affected us as kids, w= e'd=20 all be running around in darkened rooms, munching magic pills and listening= to repetitive electronic music." -- Kristian Wilson, Nintendo, Inc, 1989 |