Thread: [Dri-devel] TnL interface in the OOP/C++ world

dri-devel

[Dri-devel] TnL interface in the OOP/C++ world

From: F. <j_r...@ya...> - 2003-04-04 01:38:09

Now that, thanks to Brian, the textures are pretty much taken care of,
I'm moving into the TNL module for the C++ framework.

First, some definitions. "TnL" here is defined as the object [or module]
that handles all the geometric (vertex) data (as oppsed to the context
which handles the state). This date is supposed to be transformed,
clipped and litted and rasterized, but not all these tasks are performed
by the TnL object itself - actually they are dispatched to the hardware
as much and as soon as possible.

As a special note, the TnL receives vertices but, since usually many of
the vertice properties (color, normals, ...) don't change from one
vertice to the other, in the OpenGL you have one API for each property
(glCoord, glColor, etc.). Still, it's _whole_ vertices that it's
receiving.

My proposal for modelling the TnL module is to model it as a
producer-consumer. The producer exposes an API similar to OpenGL,
updates a "current vertex", and produces vertices from that current
vertice. The consumer receives those vertices. I.e., something like
this.

class Vertex; // Abstract vertex

class TnLConsumer {

  void consume(Vertex *vertex);
};

class TnLProducer {

  Vertex current;
  TnLConsumer *consumer;
  
  TnLProducer(TnLConsumer *_consumer) {
    consumer=_consumer;
  }
  
  Color3(r, g, b) {
    current.r = r; current.g = g; current.b = b;
  }
  
  Coord3(x, y, z) {
    current.x = x; current.y = y; current.z = z;
    produce();
  }

  produce() {
    consumer->consume(&current);
  }
};

What's special about this is that usually there isn't just a single
producer for a certain driver, but potentially a myriad of them (each
specialized for a set of hardware vertex formats or a software vertex
format). The same goes for the consumer. The appropriate producer and
consumer is chosen by the context, during a glBegin() call.

The reason to seperate the consumer and producer and not merge them
together is that when using call lists the producer/consumer won't be
sending the vertices to the card but to memory instead. This is
accomplished by using another consumer/producer wich writes/reads the
hardware vertices from memory.

This can be implmented in C++ without touching the current Mesa code, by
wrappring the current TnL code. But if the idea is pleasing we could
move the C TnL interface to this model. This would allow code for direct
hardware vertex generation (as is done in Radeon embedded driver) to
coexist nicely with code that needs to do some of TCL operations by
software.

José Fonseca

Re: [Dri-devel] TnL interface in the OOP/C++ world

From: Brian P. <br...@tu...> - 2003-04-04 15:53:43

Jos=E9 Fonseca wrote:
> Now that, thanks to Brian, the textures are pretty much taken care of,
> I'm moving into the TNL module for the C++ framework.
>=20
> First, some definitions. "TnL" here is defined as the object [or module=
]
> that handles all the geometric (vertex) data (as oppsed to the context
> which handles the state). This date is supposed to be transformed,
> clipped and litted and rasterized, but not all these tasks are performe=
d
> by the TnL object itself - actually they are dispatched to the hardware
> as much and as soon as possible.
>=20
> As a special note, the TnL receives vertices but, since usually many of
> the vertice properties (color, normals, ...) don't change from one
> vertice to the other, in the OpenGL you have one API for each property
> (glCoord, glColor, etc.). Still, it's _whole_ vertices that it's
> receiving.
>=20
> My proposal for modelling the TnL module is to model it as a
> producer-consumer. The producer exposes an API similar to OpenGL,
> updates a "current vertex", and produces vertices from that current
> vertice. The consumer receives those vertices. I.e., something like
> this.
>=20
> class Vertex; // Abstract vertex
>=20
> class TnLConsumer {
>=20
>   void consume(Vertex *vertex);
> };
>=20
> class TnLProducer {
>=20
>   Vertex current;
>   TnLConsumer *consumer;
>  =20
>   TnLProducer(TnLConsumer *_consumer) {
>     consumer=3D_consumer;
>   }
>  =20
>   Color3(r, g, b) {
>     current.r =3D r; current.g =3D g; current.b =3D b;
>   }
>  =20
>   Coord3(x, y, z) {
>     current.x =3D x; current.y =3D y; current.z =3D z;
>     produce();
>   }
>=20
>   produce() {
>     consumer->consume(&current);
>   }
> };
>=20
> What's special about this is that usually there isn't just a single
> producer for a certain driver, but potentially a myriad of them (each
> specialized for a set of hardware vertex formats or a software vertex
> format). The same goes for the consumer. The appropriate producer and
> consumer is chosen by the context, during a glBegin() call.
>=20
> The reason to seperate the consumer and producer and not merge them
> together is that when using call lists the producer/consumer won't be
> sending the vertices to the card but to memory instead. This is
> accomplished by using another consumer/producer wich writes/reads the
> hardware vertices from memory.
>=20
> This can be implmented in C++ without touching the current Mesa code, b=
y
> wrappring the current TnL code. But if the idea is pleasing we could
> move the C TnL interface to this model. This would allow code for direc=
t
> hardware vertex generation (as is done in Radeon embedded driver) to
> coexist nicely with code that needs to do some of TCL operations by
> software.

In general, this sounds reasonable but you also have to consider performa=
nce.
The glVertex, Color, TexCoord, etc commands have to be simple and fast.  =
As it=20
is now, glColor4f (for example) (when implemented in X86 assembly) is jus=
t a=20
jump into _tnl_Color4f() which stuffs the color into the immediate struct=
 and=20
returns.  Something similar is done in the R200 driver.

If the implementation of _tnl_Color4f() involves a call to producer->Colo=
r4f()=20
we'd lose some performance.

Nowadays, vertex arrays are the path to use if you really care about=20
performance, of course, but a lot of apps still use the regular per-verte=
x GL=20
functions.

-Brian

Re: [Dri-devel] TnL interface in the OOP/C++ world

From: F. <jrf...@tu...> - 2003-04-04 17:20:46

On Fri, Apr 04, 2003 at 08:48:35AM -0700, Brian Paul wrote:
> In general, this sounds reasonable but you also have to consider 
> performance.
> The glVertex, Color, TexCoord, etc commands have to be simple and fast.  As 
> it is now, glColor4f (for example) (when implemented in X86 assembly) is 
> just a jump into _tnl_Color4f() which stuffs the color into the immediate 
> struct and returns.  Something similar is done in the R200 driver.
> 
> If the implementation of _tnl_Color4f() involves a call to 
> producer->Color4f() we'd lose some performance.

I know, but my objective is to design a good object interface on which
all drivers may fit and reuse code. When a driver gets to the point
where the producer->Color4F() calls are the main performance bottleneck
(!?) the developer is free to write a tailored version of TnLProducer
that elimates that extra call:

class TnLProducerFast {

  Vertex current;
  TnLConsumer *consumer;

  TnLProducer(TnLConsumer *_consumer) {
    consumer=_consumer;
  }

  void activate() {
     _glapi_setapi(GL_COLOR3f, _Color3f)
     ...
  }

  static _Color3f(r, g, b) {
    TnLProducer *self = GET_THIS_PTR_FROM_CURRENT_CTX();
    self->current.r = r; self->current.g = g; self->current.b = b;
  }

};

We can even generate automatically this TnLProducerFast from the
original TnLProducer with a template, i.e.,

template < class T > 
class TnLProducerTmpl {

  T tnl;

  void activate() {
     _glapi_setapi(GL_COLOR3f, _Color3f)
     ...
  }

  static _Color3f(r, g, b) {
    TnLProducerTmpl *self = GET_THIS_PTR_FROM_CURRENT_CTX();
    self->tnl.Color3f(r, g, b); // This call is eliminated if T::Color3f
                                // is inlined
  }
}

typedef TnLProducerTmpl< TnLProducer > TnLProducerFast;

But this is all of _very_ _little_ importance when compared by the
ability of _writing_ a full driver fast, which is given by a well
designed OOP interface. As I said here several times, this kind of
low-level optimizations consume too much development time causing that
higher-level optimizations (usually with much more impact on
performance) are never attempted.

> Nowadays, vertex arrays are the path to use if you really care about
> performance, of course, but a lot of apps still use the regular
> per-vertex GL functions.

Now that you mention vertex array, for that, the producer would be
different, but the consumer would be the same.

José Fonseca

Re: [Dri-devel] TnL interface in the OOP/C++ world

From: Ian R. <id...@us...> - 2003-04-04 18:08:46

Jos=E9 Fonseca wrote:
> On Fri, Apr 04, 2003 at 08:48:35AM -0700, Brian Paul wrote:
>=20
>>In general, this sounds reasonable but you also have to consider=20
>>performance.
>>The glVertex, Color, TexCoord, etc commands have to be simple and fast.=
  As=20
>>it is now, glColor4f (for example) (when implemented in X86 assembly) i=
s=20
>>just a jump into _tnl_Color4f() which stuffs the color into the immedia=
te=20
>>struct and returns.  Something similar is done in the R200 driver.
>>
>>If the implementation of _tnl_Color4f() involves a call to=20
>>producer->Color4f() we'd lose some performance.
>=20
>=20
> I know, but my objective is to design a good object interface on which
> all drivers may fit and reuse code. When a driver gets to the point
> where the producer->Color4F() calls are the main performance bottleneck
> (!?) the developer is free to write a tailored version of TnLProducer
> that elimates that extra call:

Right now people use things like Viewperf to make systems purchase=20
decisions.  Unless the graphics hardware and the rest of the system are=20
very mismatched, the immediate API already has an impact on performance=20
in those benchmarks.

The performance of the immediate API *is* important to real=20
applications.  Why do you think Sun came up with the SUN_vertex=20
extension?  To reduce the overhead of the immediate API, of course. :)

[sample code cut]

> But this is all of _very_ _little_ importance when compared by the
> ability of _writing_ a full driver fast, which is given by a well
> designed OOP interface. As I said here several times, this kind of
> low-level optimizations consume too much development time causing that
> higher-level optimizations (usually with much more impact on
> performance) are never attempted.

In principle, I think the producer/consumer idea is good.  Why not=20
implement known optimizations in it from the start?  We already having=20
*working code* to build formated vertex data (see the radeon & r200=20
drivers), why not build the object model from there?  Each concrete=20
producer class would have an associated vertex format.  On creation, it=20
would fill in a table of functions to put data in its vertex buffer.=20
This could mean pointers to generic C functions, or it could mean=20
dynamically generating code from assembly stubs.

The idea is that the functions from this table could be put directly in=20
the dispatch table.  This is, IMHO, critically important.

The various vertex functions then just need to call the object's produce=20
method.  This all boils down to putting a C++ face on a technique that=20
has been demonstrated to work.

I do have one question.  Do we really want to invoke the producer on=20
every vertex immediatly?  In the radeon / r200 drivers this is just to=20
copy the whole vertex to a DMA buffer.  Why not generate the data=20
directly where it needs to go?  I know that if the vertex format changes=20
before the vertex is complete we need to copy out of the temporary=20
buffer into the GL state vector, but that doesn't seem like the common=20
case.  At the very least, some guys at Intel think generating data=20
directly in DMA buffers is the way to go:

http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm

I guess my point is that we *can* have our cake and eat it too.  We can=20
have a nice object model and have "classic" low-level optimizations.=20
The benefit of doing that optimizations at the level of the object model=20
is that they only need to be done once for a given vertex format.=20
Reusing optimizations sounds like a big win to me! :)

Re: [Dri-devel] TnL interface in the OOP/C++ world

From: F. <jrf...@tu...> - 2003-04-05 00:04:16

On Fri, Apr 04, 2003 at 10:08:36AM -0800, Ian Romanick wrote:
> Right now people use things like Viewperf to make systems purchase 
> decisions.  Unless the graphics hardware and the rest of the system are 
> very mismatched, the immediate API already has an impact on performance 
> in those benchmarks.
> 
> The performance of the immediate API *is* important to real 
> applications.  Why do you think Sun came up with the SUN_vertex 
> extension?  To reduce the overhead of the immediate API, of course. :)
> 
> [sample code cut]
> 
> >But this is all of _very_ _little_ importance when compared by the
> >ability of _writing_ a full driver fast, which is given by a well
> >designed OOP interface. As I said here several times, this kind of
> >low-level optimizations consume too much development time causing that
> >higher-level optimizations (usually with much more impact on
> >performance) are never attempted.
> 
> In principle, I think the producer/consumer idea is good.  Why not 
> implement known optimizations in it from the start?  We already having 
> *working code* to build formated vertex data (see the radeon & r200 
> drivers), why not build the object model from there?  Each concrete 
> producer class would have an associated vertex format.  On creation, it 
> would fill in a table of functions to put data in its vertex buffer. 
> This could mean pointers to generic C functions, or it could mean 
> dynamically generating code from assembly stubs.
> 
> The idea is that the functions from this table could be put directly in 
> the dispatch table.  This is, IMHO, critically important.
> 
> The various vertex functions then just need to call the object's produce 
> method.  This all boils down to putting a C++ face on a technique that 
> has been demonstrated to work.

I hope that integration of assembly generation with C++ is feasible but
I see it as an implementation issue, regardless the preformance issues,
which according to all who have replied aren't that neglectable as I
though.  The reason is that this kind of optimizations is very dependent
of the vertex formats and other hardware details dificulting reusing the
code - which is exactly what I want to avoid at this stage.

> I do have one question.  Do we really want to invoke the producer on 
> every vertex immediatly?  In the radeon / r200 drivers this is just to 
> copy the whole vertex to a DMA buffer.  Why not generate the data 
> directly where it needs to go?  I know that if the vertex format changes 
> before the vertex is complete we need to copy out of the temporary 
> buffer into the GL state vector, but that doesn't seem like the common 
> case.  At the very least, some guys at Intel think generating data 
> directly in DMA buffers is the way to go:
> 
> http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm

This is a very interesting read. Thanks for the pointer.

It's complicated to know the vertices position on the DMA from the beginning,
specially because of the clipping, since vertices can be added or
removed, but if I understood correctly, it's still better to do that on
the DMA memory and move the vertices around to avoid cache hits. But can
be very tricky: imagine that clipping generate vertices that don't fit
the DMA buffer anymore, what would be done then?

The things I found more interesting in the issue of applting the TCL
operations on all the vertices at once, or a vertice at each time. From
previous discussions on this list it seems that nowadays most
of CPU performace is dictated by the cache, so it really seems the later
option is more efficient, but Mesa implements the former (they are even
called "pipeline stages") and to change would mean a big overhaul of the
TnL module.

> I guess my point is that we *can* have our cake and eat it too.  We can 
> have a nice object model and have "classic" low-level optimizations. 
> The benefit of doing that optimizations at the level of the object model 
> is that they only need to be done once for a given vertex format. 
> Reusing optimizations sounds like a big win to me! :)

I hope so. But at this point I'll just try to design the objects so they
allow both kind of implementations. 

Thanks for the feedback.

José Fonseca