Re: GL_ARB_texture_env_crossbar for r200

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Roland Scheidegger wrote:

> I'm actually wondering how ATI solved that problem in their driver, I
> couldn't see an easy way out to avoid the fallback - even using the 2
> additional tex env stages or the second phase of the fragment pipeline
> isn't going to fix the issue I think. Maybe someone else has a good idea?

So, for any set of texture environments there is an ordering of
operations and an assignment of registers that will work.  Once upon a
time I wrote a python script that implemented a simple algorithm to do
this.  I'll have to see if I can dig it up.

The algorithm works in two passes.  The first pass identifies any
texture stages and texture reads, if any, that do not contribute to the
final result.  I'm going to use the notation T# for a texture read and P
for the previous result.  If the texture environment is { {T0 + T1} {T3
- - T2} }, then T0, T1, and the result of adding them don't contribute to
the final result.  You can omit those stages entirely and freely use
those registers as temporaries.

The second pass assigns registers.  Each T# gets assigned the next R#,
in order.  If T0, T1, and T4 contribute to the final result, they get
assigned R0, R1, and R2.  Next, each P gets assigned an available
register.  A register is available if its either unassigned or its value
will not be read again.  At any point, there is *always* an available
register.  I think this is mathematically provable, but it's way beyond
my patience to do so. :)

Here are a couple examples.  I have left out the operations for clarity.
 I'm also going to simplify a bit.  I assume 3 textures, 3 registers, 3
stages, and 2 reads per stage.

Start:    {T0, T2}, {P , T0}, {T1, P }
Pass 1:   {T0, T2}, {P , T0}, {T1, P }
Pass 2.1: {R0, R2}, {P , R0}, {R1, P }
Pass 2.2: {R0, R2}, {R2, R0}, {R1, R0}

Start:    {T0, T2}, {T1, T0}, {T1, P }
Pass 1:   {T1, T0}, {T1, P }
Pass 2.1: {R1, R0}, {R1, P }
Pass 2.2: {R1, R0}, {R1, R0}

Working through this, I noticed something that I hadn't noticed before.
 This technique only works if each operation cannot access the entire
register set.  I first did it with 3 reads per stage, and I very quickly
came up with some impossible examples. :)  3 reads w/6 registers will
still work.

The nice thing about this algorithm is that it not only works, but it
eliminates "dead code" and unused textures.  I don't know about the
former, but the later can certainly improve the performance of ill
written code.  In addition, this same algorithm could be used to
optimize ATI_fragment_program code.  It should also make it possible to
implement NV_texture_env_combine4, which is used by a lot more programs
than ATI_texture_env_combine3.  In both these cases you need to expand
the notation to have multiple P values.

Other optimizations are possible, but I never explored them.  Most of
the ones that I could think of are probably unlikely in practice.  Doing
things like replacing {T1 + T2}, {P + P}, {P + T3} with {T1 + T2}*2, {P
+ T3}, or replacing {T1 * T2}, {P + T0} with {T1 * T2 + T0} are
possible, but probably not worth the effort.

I think the right way to actually implement this in the driver is to
convert texture env (be it ARB_texture_env_combine /
ATI_texture_env_combine3 or NV_texture_env_combine4) into an
ATI_fragment_program and optimize that.  Doing it that way effectively
kills two birds with one stone.  We can get away with that here because
the texture env will only ever require one pass.  One nice thing about
doing it that way is you can write an application that converts texture
env scripts to ATI_fragment_programs.  You can compare the direct
implementation of the texture env with the generated
ATI_fragment_program.  That should be a *lot* easier to debug than doing
it in the driver code!

It's also worth noting that a similar technique can be applied in the
i830 driver to implement ATI_texture_env_combine3.  The i830 implements
*most* of the required instructions.  The unavailable instructions can
be implemented by simpler operations (e.g., {T0*T1-T2} becomes {T0*T1}
{P-T2}).  Adding the optimization pass, especially if it *did* the
optimizations that said were "probably not worth the effort", would
reduce the chances of needing a fallback.  An env like {T0*T1-T2} {P+T3}
{P*C} {P+T0} would be optimized to {T0*T1+T3} {P-T2} {P*C+T0}.

If you don't think you want to tackle this now, I'll gather up my python
script and all my notes on the subject and file an enhancement bug.
That way none of the information will get lost / forgotten.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFC4JM7X1gOwKyEAw8RAkGqAJkBbllTflRuCtOiV8PxwFtDiJMGLQCfaEhg
ra2W8jQQae1odI95g/RwQ5Y=
=cGeR
-----END PGP SIGNATURE-----