From: F. <j_r...@ya...> - 2003-04-04 01:38:09
|
Now that, thanks to Brian, the textures are pretty much taken care of, I'm moving into the TNL module for the C++ framework. First, some definitions. "TnL" here is defined as the object [or module] that handles all the geometric (vertex) data (as oppsed to the context which handles the state). This date is supposed to be transformed, clipped and litted and rasterized, but not all these tasks are performed by the TnL object itself - actually they are dispatched to the hardware as much and as soon as possible. As a special note, the TnL receives vertices but, since usually many of the vertice properties (color, normals, ...) don't change from one vertice to the other, in the OpenGL you have one API for each property (glCoord, glColor, etc.). Still, it's _whole_ vertices that it's receiving. My proposal for modelling the TnL module is to model it as a producer-consumer. The producer exposes an API similar to OpenGL, updates a "current vertex", and produces vertices from that current vertice. The consumer receives those vertices. I.e., something like this. class Vertex; // Abstract vertex class TnLConsumer { void consume(Vertex *vertex); }; class TnLProducer { Vertex current; TnLConsumer *consumer; TnLProducer(TnLConsumer *_consumer) { consumer=_consumer; } Color3(r, g, b) { current.r = r; current.g = g; current.b = b; } Coord3(x, y, z) { current.x = x; current.y = y; current.z = z; produce(); } produce() { consumer->consume(¤t); } }; What's special about this is that usually there isn't just a single producer for a certain driver, but potentially a myriad of them (each specialized for a set of hardware vertex formats or a software vertex format). The same goes for the consumer. The appropriate producer and consumer is chosen by the context, during a glBegin() call. The reason to seperate the consumer and producer and not merge them together is that when using call lists the producer/consumer won't be sending the vertices to the card but to memory instead. This is accomplished by using another consumer/producer wich writes/reads the hardware vertices from memory. This can be implmented in C++ without touching the current Mesa code, by wrappring the current TnL code. But if the idea is pleasing we could move the C TnL interface to this model. This would allow code for direct hardware vertex generation (as is done in Radeon embedded driver) to coexist nicely with code that needs to do some of TCL operations by software. José Fonseca |
From: Brian P. <br...@tu...> - 2003-04-04 15:53:43
|
Jos=E9 Fonseca wrote: > Now that, thanks to Brian, the textures are pretty much taken care of, > I'm moving into the TNL module for the C++ framework. >=20 > First, some definitions. "TnL" here is defined as the object [or module= ] > that handles all the geometric (vertex) data (as oppsed to the context > which handles the state). This date is supposed to be transformed, > clipped and litted and rasterized, but not all these tasks are performe= d > by the TnL object itself - actually they are dispatched to the hardware > as much and as soon as possible. >=20 > As a special note, the TnL receives vertices but, since usually many of > the vertice properties (color, normals, ...) don't change from one > vertice to the other, in the OpenGL you have one API for each property > (glCoord, glColor, etc.). Still, it's _whole_ vertices that it's > receiving. >=20 > My proposal for modelling the TnL module is to model it as a > producer-consumer. The producer exposes an API similar to OpenGL, > updates a "current vertex", and produces vertices from that current > vertice. The consumer receives those vertices. I.e., something like > this. >=20 > class Vertex; // Abstract vertex >=20 > class TnLConsumer { >=20 > void consume(Vertex *vertex); > }; >=20 > class TnLProducer { >=20 > Vertex current; > TnLConsumer *consumer; > =20 > TnLProducer(TnLConsumer *_consumer) { > consumer=3D_consumer; > } > =20 > Color3(r, g, b) { > current.r =3D r; current.g =3D g; current.b =3D b; > } > =20 > Coord3(x, y, z) { > current.x =3D x; current.y =3D y; current.z =3D z; > produce(); > } >=20 > produce() { > consumer->consume(¤t); > } > }; >=20 > What's special about this is that usually there isn't just a single > producer for a certain driver, but potentially a myriad of them (each > specialized for a set of hardware vertex formats or a software vertex > format). The same goes for the consumer. The appropriate producer and > consumer is chosen by the context, during a glBegin() call. >=20 > The reason to seperate the consumer and producer and not merge them > together is that when using call lists the producer/consumer won't be > sending the vertices to the card but to memory instead. This is > accomplished by using another consumer/producer wich writes/reads the > hardware vertices from memory. >=20 > This can be implmented in C++ without touching the current Mesa code, b= y > wrappring the current TnL code. But if the idea is pleasing we could > move the C TnL interface to this model. This would allow code for direc= t > hardware vertex generation (as is done in Radeon embedded driver) to > coexist nicely with code that needs to do some of TCL operations by > software. In general, this sounds reasonable but you also have to consider performa= nce. The glVertex, Color, TexCoord, etc commands have to be simple and fast. = As it=20 is now, glColor4f (for example) (when implemented in X86 assembly) is jus= t a=20 jump into _tnl_Color4f() which stuffs the color into the immediate struct= and=20 returns. Something similar is done in the R200 driver. If the implementation of _tnl_Color4f() involves a call to producer->Colo= r4f()=20 we'd lose some performance. Nowadays, vertex arrays are the path to use if you really care about=20 performance, of course, but a lot of apps still use the regular per-verte= x GL=20 functions. -Brian |
From: F. <jrf...@tu...> - 2003-04-04 17:20:46
|
On Fri, Apr 04, 2003 at 08:48:35AM -0700, Brian Paul wrote: > In general, this sounds reasonable but you also have to consider > performance. > The glVertex, Color, TexCoord, etc commands have to be simple and fast. As > it is now, glColor4f (for example) (when implemented in X86 assembly) is > just a jump into _tnl_Color4f() which stuffs the color into the immediate > struct and returns. Something similar is done in the R200 driver. > > If the implementation of _tnl_Color4f() involves a call to > producer->Color4f() we'd lose some performance. I know, but my objective is to design a good object interface on which all drivers may fit and reuse code. When a driver gets to the point where the producer->Color4F() calls are the main performance bottleneck (!?) the developer is free to write a tailored version of TnLProducer that elimates that extra call: class TnLProducerFast { Vertex current; TnLConsumer *consumer; TnLProducer(TnLConsumer *_consumer) { consumer=_consumer; } void activate() { _glapi_setapi(GL_COLOR3f, _Color3f) ... } static _Color3f(r, g, b) { TnLProducer *self = GET_THIS_PTR_FROM_CURRENT_CTX(); self->current.r = r; self->current.g = g; self->current.b = b; } }; We can even generate automatically this TnLProducerFast from the original TnLProducer with a template, i.e., template < class T > class TnLProducerTmpl { T tnl; void activate() { _glapi_setapi(GL_COLOR3f, _Color3f) ... } static _Color3f(r, g, b) { TnLProducerTmpl *self = GET_THIS_PTR_FROM_CURRENT_CTX(); self->tnl.Color3f(r, g, b); // This call is eliminated if T::Color3f // is inlined } } typedef TnLProducerTmpl< TnLProducer > TnLProducerFast; But this is all of _very_ _little_ importance when compared by the ability of _writing_ a full driver fast, which is given by a well designed OOP interface. As I said here several times, this kind of low-level optimizations consume too much development time causing that higher-level optimizations (usually with much more impact on performance) are never attempted. > Nowadays, vertex arrays are the path to use if you really care about > performance, of course, but a lot of apps still use the regular > per-vertex GL functions. Now that you mention vertex array, for that, the producer would be different, but the consumer would be the same. José Fonseca |
From: Ian R. <id...@us...> - 2003-04-04 18:08:46
|
Jos=E9 Fonseca wrote: > On Fri, Apr 04, 2003 at 08:48:35AM -0700, Brian Paul wrote: >=20 >>In general, this sounds reasonable but you also have to consider=20 >>performance. >>The glVertex, Color, TexCoord, etc commands have to be simple and fast.= As=20 >>it is now, glColor4f (for example) (when implemented in X86 assembly) i= s=20 >>just a jump into _tnl_Color4f() which stuffs the color into the immedia= te=20 >>struct and returns. Something similar is done in the R200 driver. >> >>If the implementation of _tnl_Color4f() involves a call to=20 >>producer->Color4f() we'd lose some performance. >=20 >=20 > I know, but my objective is to design a good object interface on which > all drivers may fit and reuse code. When a driver gets to the point > where the producer->Color4F() calls are the main performance bottleneck > (!?) the developer is free to write a tailored version of TnLProducer > that elimates that extra call: Right now people use things like Viewperf to make systems purchase=20 decisions. Unless the graphics hardware and the rest of the system are=20 very mismatched, the immediate API already has an impact on performance=20 in those benchmarks. The performance of the immediate API *is* important to real=20 applications. Why do you think Sun came up with the SUN_vertex=20 extension? To reduce the overhead of the immediate API, of course. :) [sample code cut] > But this is all of _very_ _little_ importance when compared by the > ability of _writing_ a full driver fast, which is given by a well > designed OOP interface. As I said here several times, this kind of > low-level optimizations consume too much development time causing that > higher-level optimizations (usually with much more impact on > performance) are never attempted. In principle, I think the producer/consumer idea is good. Why not=20 implement known optimizations in it from the start? We already having=20 *working code* to build formated vertex data (see the radeon & r200=20 drivers), why not build the object model from there? Each concrete=20 producer class would have an associated vertex format. On creation, it=20 would fill in a table of functions to put data in its vertex buffer.=20 This could mean pointers to generic C functions, or it could mean=20 dynamically generating code from assembly stubs. The idea is that the functions from this table could be put directly in=20 the dispatch table. This is, IMHO, critically important. The various vertex functions then just need to call the object's produce=20 method. This all boils down to putting a C++ face on a technique that=20 has been demonstrated to work. I do have one question. Do we really want to invoke the producer on=20 every vertex immediatly? In the radeon / r200 drivers this is just to=20 copy the whole vertex to a DMA buffer. Why not generate the data=20 directly where it needs to go? I know that if the vertex format changes=20 before the vertex is complete we need to copy out of the temporary=20 buffer into the GL state vector, but that doesn't seem like the common=20 case. At the very least, some guys at Intel think generating data=20 directly in DMA buffers is the way to go: http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm I guess my point is that we *can* have our cake and eat it too. We can=20 have a nice object model and have "classic" low-level optimizations.=20 The benefit of doing that optimizations at the level of the object model=20 is that they only need to be done once for a given vertex format.=20 Reusing optimizations sounds like a big win to me! :) |
From: F. <jrf...@tu...> - 2003-04-05 00:04:16
|
On Fri, Apr 04, 2003 at 10:08:36AM -0800, Ian Romanick wrote: > Right now people use things like Viewperf to make systems purchase > decisions. Unless the graphics hardware and the rest of the system are > very mismatched, the immediate API already has an impact on performance > in those benchmarks. > > The performance of the immediate API *is* important to real > applications. Why do you think Sun came up with the SUN_vertex > extension? To reduce the overhead of the immediate API, of course. :) > > [sample code cut] > > >But this is all of _very_ _little_ importance when compared by the > >ability of _writing_ a full driver fast, which is given by a well > >designed OOP interface. As I said here several times, this kind of > >low-level optimizations consume too much development time causing that > >higher-level optimizations (usually with much more impact on > >performance) are never attempted. > > In principle, I think the producer/consumer idea is good. Why not > implement known optimizations in it from the start? We already having > *working code* to build formated vertex data (see the radeon & r200 > drivers), why not build the object model from there? Each concrete > producer class would have an associated vertex format. On creation, it > would fill in a table of functions to put data in its vertex buffer. > This could mean pointers to generic C functions, or it could mean > dynamically generating code from assembly stubs. > > The idea is that the functions from this table could be put directly in > the dispatch table. This is, IMHO, critically important. > > The various vertex functions then just need to call the object's produce > method. This all boils down to putting a C++ face on a technique that > has been demonstrated to work. I hope that integration of assembly generation with C++ is feasible but I see it as an implementation issue, regardless the preformance issues, which according to all who have replied aren't that neglectable as I though. The reason is that this kind of optimizations is very dependent of the vertex formats and other hardware details dificulting reusing the code - which is exactly what I want to avoid at this stage. > I do have one question. Do we really want to invoke the producer on > every vertex immediatly? In the radeon / r200 drivers this is just to > copy the whole vertex to a DMA buffer. Why not generate the data > directly where it needs to go? I know that if the vertex format changes > before the vertex is complete we need to copy out of the temporary > buffer into the GL state vector, but that doesn't seem like the common > case. At the very least, some guys at Intel think generating data > directly in DMA buffers is the way to go: > > http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm This is a very interesting read. Thanks for the pointer. It's complicated to know the vertices position on the DMA from the beginning, specially because of the clipping, since vertices can be added or removed, but if I understood correctly, it's still better to do that on the DMA memory and move the vertices around to avoid cache hits. But can be very tricky: imagine that clipping generate vertices that don't fit the DMA buffer anymore, what would be done then? The things I found more interesting in the issue of applting the TCL operations on all the vertices at once, or a vertice at each time. From previous discussions on this list it seems that nowadays most of CPU performace is dictated by the cache, so it really seems the later option is more efficient, but Mesa implements the former (they are even called "pipeline stages") and to change would mean a big overhaul of the TnL module. > I guess my point is that we *can* have our cake and eat it too. We can > have a nice object model and have "classic" low-level optimizations. > The benefit of doing that optimizations at the level of the object model > is that they only need to be done once for a given vertex format. > Reusing optimizations sounds like a big win to me! :) I hope so. But at this point I'll just try to design the objects so they allow both kind of implementations. Thanks for the feedback. José Fonseca |
From: Brian P. <br...@tu...> - 2003-04-05 00:19:55
|
Jos=E9 Fonseca wrote: > On Fri, Apr 04, 2003 at 10:08:36AM -0800, Ian Romanick wrote: >=20 >>Right now people use things like Viewperf to make systems purchase=20 >>decisions. Unless the graphics hardware and the rest of the system are= =20 >>very mismatched, the immediate API already has an impact on performance= =20 >>in those benchmarks. >> >>The performance of the immediate API *is* important to real=20 >>applications. Why do you think Sun came up with the SUN_vertex=20 >>extension? To reduce the overhead of the immediate API, of course. :) >> >>[sample code cut] >> >> >>>But this is all of _very_ _little_ importance when compared by the >>>ability of _writing_ a full driver fast, which is given by a well >>>designed OOP interface. As I said here several times, this kind of >>>low-level optimizations consume too much development time causing that >>>higher-level optimizations (usually with much more impact on >>>performance) are never attempted. >> >>In principle, I think the producer/consumer idea is good. Why not=20 >>implement known optimizations in it from the start? We already having=20 >>*working code* to build formated vertex data (see the radeon & r200=20 >>drivers), why not build the object model from there? Each concrete=20 >>producer class would have an associated vertex format. On creation, it= =20 >>would fill in a table of functions to put data in its vertex buffer.=20 >>This could mean pointers to generic C functions, or it could mean=20 >>dynamically generating code from assembly stubs. >> >>The idea is that the functions from this table could be put directly in= =20 >>the dispatch table. This is, IMHO, critically important. >> >>The various vertex functions then just need to call the object's produc= e=20 >>method. This all boils down to putting a C++ face on a technique that=20 >>has been demonstrated to work. >=20 >=20 > I hope that integration of assembly generation with C++ is feasible but > I see it as an implementation issue, regardless the preformance issues, > which according to all who have replied aren't that neglectable as I > though. The reason is that this kind of optimizations is very dependen= t > of the vertex formats and other hardware details dificulting reusing th= e > code - which is exactly what I want to avoid at this stage. >=20 >=20 >>I do have one question. Do we really want to invoke the producer on=20 >>every vertex immediatly? In the radeon / r200 drivers this is just to=20 >>copy the whole vertex to a DMA buffer. Why not generate the data=20 >>directly where it needs to go? I know that if the vertex format change= s=20 >>before the vertex is complete we need to copy out of the temporary=20 >>buffer into the GL state vector, but that doesn't seem like the common=20 >>case. At the very least, some guys at Intel think generating data=20 >>directly in DMA buffers is the way to go: >> >>http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm >=20 >=20 > This is a very interesting read. Thanks for the pointer. >=20 > It's complicated to know the vertices position on the DMA from the begi= nning, > specially because of the clipping, since vertices can be added or > removed, but if I understood correctly, it's still better to do that on > the DMA memory and move the vertices around to avoid cache hits. But ca= n > be very tricky: imagine that clipping generate vertices that don't fit > the DMA buffer anymore, what would be done then? >=20 > The things I found more interesting in the issue of applting the TCL > operations on all the vertices at once, or a vertice at each time. From > previous discussions on this list it seems that nowadays most > of CPU performace is dictated by the cache, so it really seems the late= r > option is more efficient, but Mesa implements the former (they are even > called "pipeline stages") and to change would mean a big overhaul of th= e > TnL module. On a historical note, the earliest versions of Mesa processed a single ve= rtex=20 at a time, instead of operating on arrays of vertices, stage by stage. G= oing=20 to the later was a big speed up at the time. Since the T&L code is a module, one could implement the single-vertex sch= eme=20 as an alternate module. It would be an interesting experiment. >>I guess my point is that we *can* have our cake and eat it too. We can= =20 >>have a nice object model and have "classic" low-level optimizations.=20 >>The benefit of doing that optimizations at the level of the object mode= l=20 >>is that they only need to be done once for a given vertex format.=20 >>Reusing optimizations sounds like a big win to me! :) >=20 >=20 > I hope so. But at this point I'll just try to design the objects so the= y > allow both kind of implementations.=20 >=20 > Thanks for the feedback. >=20 > Jos=E9 Fonseca -Brian |
From: F. <jrf...@tu...> - 2003-04-05 00:57:27
|
On Fri, Apr 04, 2003 at 05:14:54PM -0700, Brian Paul wrote: > José Fonseca wrote: > >The things I found more interesting in the issue of applting the TCL > >operations on all the vertices at once, or a vertice at each time. From > >previous discussions on this list it seems that nowadays most > >of CPU performace is dictated by the cache, so it really seems the later > >option is more efficient, but Mesa implements the former (they are even > >called "pipeline stages") and to change would mean a big overhaul of the > >TnL module. > > On a historical note, the earliest versions of Mesa processed a single > vertex at a time, instead of operating on arrays of vertices, stage by > stage. Going to the later was a big speed up at the time. Yes, and the use of the SIMD instructions also favors that approach. Actually on that article they've chosen to process 4 vertices at a time and not just one, surely because that's the number that fits on the MM registers. I think that the fact that CPUs got so much faster but BUSes didn't keep up pace contributed to change the picture making non-cached memory access look awfully slow compared with everythin else. > Since the T&L code is a module, one could implement the single-vertex > scheme as an alternate module. It would be an interesting experiment. Indeed. José Fonseca |
From: Keith W. <ke...@tu...> - 2003-04-05 06:24:54
|
Brian Paul wrote: > José Fonseca wrote: > >> On Fri, Apr 04, 2003 at 10:08:36AM -0800, Ian Romanick wrote: >> >>> Right now people use things like Viewperf to make systems purchase >>> decisions. Unless the graphics hardware and the rest of the system >>> are very mismatched, the immediate API already has an impact on >>> performance in those benchmarks. >>> >>> The performance of the immediate API *is* important to real >>> applications. Why do you think Sun came up with the SUN_vertex >>> extension? To reduce the overhead of the immediate API, of course. :) >>> >>> [sample code cut] >>> >>> >>>> But this is all of _very_ _little_ importance when compared by the >>>> ability of _writing_ a full driver fast, which is given by a well >>>> designed OOP interface. As I said here several times, this kind of >>>> low-level optimizations consume too much development time causing that >>>> higher-level optimizations (usually with much more impact on >>>> performance) are never attempted. >>> >>> >>> In principle, I think the producer/consumer idea is good. Why not >>> implement known optimizations in it from the start? We already >>> having *working code* to build formated vertex data (see the radeon & >>> r200 drivers), why not build the object model from there? Each >>> concrete producer class would have an associated vertex format. On >>> creation, it would fill in a table of functions to put data in its >>> vertex buffer. This could mean pointers to generic C functions, or it >>> could mean dynamically generating code from assembly stubs. >>> >>> The idea is that the functions from this table could be put directly >>> in the dispatch table. This is, IMHO, critically important. >>> >>> The various vertex functions then just need to call the object's >>> produce method. This all boils down to putting a C++ face on a >>> technique that has been demonstrated to work. >> >> >> >> I hope that integration of assembly generation with C++ is feasible but >> I see it as an implementation issue, regardless the preformance issues, >> which according to all who have replied aren't that neglectable as I >> though. The reason is that this kind of optimizations is very dependent >> of the vertex formats and other hardware details dificulting reusing the >> code - which is exactly what I want to avoid at this stage. >> >> >>> I do have one question. Do we really want to invoke the producer on >>> every vertex immediatly? In the radeon / r200 drivers this is just >>> to copy the whole vertex to a DMA buffer. Why not generate the data >>> directly where it needs to go? I know that if the vertex format >>> changes before the vertex is complete we need to copy out of the >>> temporary buffer into the GL state vector, but that doesn't seem like >>> the common case. At the very least, some guys at Intel think >>> generating data directly in DMA buffers is the way to go: >>> >>> http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm >> >> >> >> This is a very interesting read. Thanks for the pointer. >> >> It's complicated to know the vertices position on the DMA from the >> beginning, >> specially because of the clipping, since vertices can be added or >> removed, but if I understood correctly, it's still better to do that on >> the DMA memory and move the vertices around to avoid cache hits. But can >> be very tricky: imagine that clipping generate vertices that don't fit >> the DMA buffer anymore, what would be done then? >> >> The things I found more interesting in the issue of applting the TCL >> operations on all the vertices at once, or a vertice at each time. From >> previous discussions on this list it seems that nowadays most >> of CPU performace is dictated by the cache, so it really seems the later >> option is more efficient, but Mesa implements the former (they are even >> called "pipeline stages") and to change would mean a big overhaul of the >> TnL module. > > > On a historical note, the earliest versions of Mesa processed a single > vertex at a time, instead of operating on arrays of vertices, stage by > stage. Going to the later was a big speed up at the time. > > Since the T&L code is a module, one could implement the single-vertex > scheme as an alternate module. It would be an interesting experiment. For very simple modes, eg. quake/quake2 where there is basically only clipping to do this can work very well. The 3dfx 'minigl' driver worked this way, processing a single vertex at a time & then clipping each triangle once produced. However, for a full GL pipeline it's not such a good proposition. One difficulty is dealing with fallbacks, if anyone tries to implement this you'll see what I mean - you want to 1) throw away intermediate data for performance reasons and 2) keep it hanging around in case you need to fallback. Keith |
From: Ian R. <id...@us...> - 2003-04-05 01:13:55
|
Jos=E9 Fonseca wrote: > On Fri, Apr 04, 2003 at 10:08:36AM -0800, Ian Romanick wrote: >>In principle, I think the producer/consumer idea is good. Why not=20 >>implement known optimizations in it from the start? We already having=20 >>*working code* to build formated vertex data (see the radeon & r200=20 >>drivers), why not build the object model from there? Each concrete=20 >>producer class would have an associated vertex format. On creation, it= =20 >>would fill in a table of functions to put data in its vertex buffer.=20 >>This could mean pointers to generic C functions, or it could mean=20 >>dynamically generating code from assembly stubs. >> >>The idea is that the functions from this table could be put directly in= =20 >>the dispatch table. This is, IMHO, critically important. >> >>The various vertex functions then just need to call the object's produc= e=20 >>method. This all boils down to putting a C++ face on a technique that=20 >>has been demonstrated to work. >=20 >=20 > I hope that integration of assembly generation with C++ is feasible but > I see it as an implementation issue, regardless the preformance issues, > which according to all who have replied aren't that neglectable as I > though. The reason is that this kind of optimizations is very dependen= t > of the vertex formats and other hardware details dificulting reusing th= e > code - which is exactly what I want to avoid at this stage. Realistically, either hardware or software uses either=20 array-of-strctures or structure-of-arrays. Most hardware uses the=20 former. At that point it becomes a matter of, for a given state vector,=20 what's the offset in the structure of an element? The assembly code in=20 the radeon & r200 drivers handles this very nicely. >>I do have one question. Do we really want to invoke the producer on=20 >>every vertex immediatly? In the radeon / r200 drivers this is just to=20 >>copy the whole vertex to a DMA buffer. Why not generate the data=20 >>directly where it needs to go? I know that if the vertex format change= s=20 >>before the vertex is complete we need to copy out of the temporary=20 >>buffer into the GL state vector, but that doesn't seem like the common=20 >>case. At the very least, some guys at Intel think generating data=20 >>directly in DMA buffers is the way to go: >> >>http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm >=20 >=20 > This is a very interesting read. Thanks for the pointer. >=20 > It's complicated to know the vertices position on the DMA from the begi= nning, > specially because of the clipping, since vertices can be added or > removed, but if I understood correctly, it's still better to do that on > the DMA memory and move the vertices around to avoid cache hits. But ca= n > be very tricky: imagine that clipping generate vertices that don't fit > the DMA buffer anymore, what would be done then? I think the "online driver model" from the paper only works if you have=20 a single loop that does all the processing. Since Mesa uses a pipeline,=20 it would be very tricky. Using the "online driver model" for a card=20 w/HW TCL would be a different story. > The things I found more interesting in the issue of applting the TCL > operations on all the vertices at once, or a vertice at each time. From > previous discussions on this list it seems that nowadays most > of CPU performace is dictated by the cache, so it really seems the late= r > option is more efficient, but Mesa implements the former (they are even > called "pipeline stages") and to change would mean a big overhaul of th= e > TnL module. This would be very, very, very tricky. We'd basically need several=20 different super-loops depending on the GL state vector. The super-loops=20 would go in the pipeline at the same place where the hardware TCL=20 functions go. If the super-loop could do all the processing, the=20 following TCL stages would be skipped. Before going down that road we'd want to sit down with oprofile and a=20 bunch of applications to decide which sets of state we wanted to tune=20 for. IMHO, we'd be better to spend our time writing a highly optimized=20 just-in-time compiler for ARB_vertex_program. Then we could just write=20 vertex programs for the different "important" state vectors and let the=20 compiler generate the super-loop. Of course, there are still "issues"=20 with vertex programs. :( |
From: F. <jrf...@tu...> - 2003-04-05 01:40:33
|
On Fri, Apr 04, 2003 at 05:13:44PM -0800, Ian Romanick wrote: > Realistically, either hardware or software uses either > array-of-strctures or structure-of-arrays. Most hardware uses the > former. At that point it becomes a matter of, for a given state vector, > what's the offset in the structure of an element? The assembly code in > the radeon & r200 drivers handles this very nicely. You're forgetting the data type. Perhaps recent hardware only uses unsigned chars for color and floats for the rest, but on older hardware (such as Mach64) you have quite a mix of floating point, integer, and fixed point datatypes... > >The things I found more interesting in the issue of applting the TCL > >operations on all the vertices at once, or a vertice at each time. From > >previous discussions on this list it seems that nowadays most > >of CPU performace is dictated by the cache, so it really seems the later > >option is more efficient, but Mesa implements the former (they are even > >called "pipeline stages") and to change would mean a big overhaul of the > >TnL module. > > This would be very, very, very tricky. We'd basically need several > different super-loops depending on the GL state vector. The super-loops > would go in the pipeline at the same place where the hardware TCL > functions go. If the super-loop could do all the processing, the > following TCL stages would be skipped. This kind of thing is already done in the vertex buffer construction with templates whose sections are #ifdef'd out for each "super-loop" according do the state it's meant for. If we used C++ for this we could use templates to instantiante classes for all vertex format and TCL operations combinations possible. > Before going down that road we'd want to sit down with oprofile and a > bunch of applications to decide which sets of state we wanted to tune > for. IMHO, we'd be better to spend our time writing a highly optimized > just-in-time compiler for ARB_vertex_program. Then we could just write > vertex programs for the different "important" state vectors and let the > compiler generate the super-loop. Of course, there are still "issues" > with vertex programs. :( José Fonseca |
From: Allen A. <ak...@po...> - 2003-04-05 03:08:45
|
On Fri, Apr 04, 2003 at 05:13:44PM -0800, Ian Romanick wrote: | .... IMHO, we'd be better to spend our time writing a highly optimized | just-in-time compiler for ARB_vertex_program. Then we could just write | vertex programs for the different "important" state vectors and let the | compiler generate the super-loop. ... Which brings to mind one of my favorite papers on dynamic compilation: http://citeseer.nj.nec.com/massalin92synthesi.html | ... Of course, there are still "issues" | with vertex programs. :( The ARB IP working group is trying to resolve those, or at least come up with a way to resolve them. Nothing cast in concrete yet, unfortunately. Allen |
From: Keith W. <ke...@tu...> - 2003-04-05 06:28:53
|
Ian Romanick wrote: > José Fonseca wrote: > >> On Fri, Apr 04, 2003 at 10:08:36AM -0800, Ian Romanick wrote: >> >>> In principle, I think the producer/consumer idea is good. Why not >>> implement known optimizations in it from the start? We already >>> having *working code* to build formated vertex data (see the radeon & >>> r200 drivers), why not build the object model from there? Each >>> concrete producer class would have an associated vertex format. On >>> creation, it would fill in a table of functions to put data in its >>> vertex buffer. This could mean pointers to generic C functions, or it >>> could mean dynamically generating code from assembly stubs. >>> >>> The idea is that the functions from this table could be put directly >>> in the dispatch table. This is, IMHO, critically important. >>> >>> The various vertex functions then just need to call the object's >>> produce method. This all boils down to putting a C++ face on a >>> technique that has been demonstrated to work. >> >> >> >> I hope that integration of assembly generation with C++ is feasible but >> I see it as an implementation issue, regardless the preformance issues, >> which according to all who have replied aren't that neglectable as I >> though. The reason is that this kind of optimizations is very dependent >> of the vertex formats and other hardware details dificulting reusing the >> code - which is exactly what I want to avoid at this stage. > > > Realistically, either hardware or software uses either > array-of-strctures or structure-of-arrays. Most hardware uses the > former. At that point it becomes a matter of, for a given state vector, > what's the offset in the structure of an element? The assembly code in > the radeon & r200 drivers handles this very nicely. > >>> I do have one question. Do we really want to invoke the producer on >>> every vertex immediatly? In the radeon / r200 drivers this is just >>> to copy the whole vertex to a DMA buffer. Why not generate the data >>> directly where it needs to go? I know that if the vertex format >>> changes before the vertex is complete we need to copy out of the >>> temporary buffer into the GL state vector, but that doesn't seem like >>> the common case. At the very least, some guys at Intel think >>> generating data directly in DMA buffers is the way to go: >>> >>> http://www.intel.com/technology/itj/Q21999/ARTICLES/art_4.htm >> >> >> >> This is a very interesting read. Thanks for the pointer. >> >> It's complicated to know the vertices position on the DMA from the >> beginning, >> specially because of the clipping, since vertices can be added or >> removed, but if I understood correctly, it's still better to do that on >> the DMA memory and move the vertices around to avoid cache hits. But can >> be very tricky: imagine that clipping generate vertices that don't fit >> the DMA buffer anymore, what would be done then? > > > I think the "online driver model" from the paper only works if you have > a single loop that does all the processing. Since Mesa uses a pipeline, > it would be very tricky. Using the "online driver model" for a card > w/HW TCL would be a different story. > >> The things I found more interesting in the issue of applting the TCL >> operations on all the vertices at once, or a vertice at each time. From >> previous discussions on this list it seems that nowadays most >> of CPU performace is dictated by the cache, so it really seems the later >> option is more efficient, but Mesa implements the former (they are even >> called "pipeline stages") and to change would mean a big overhaul of the >> TnL module. > > > This would be very, very, very tricky. We'd basically need several > different super-loops depending on the GL state vector. The super-loops > would go in the pipeline at the same place where the hardware TCL > functions go. If the super-loop could do all the processing, the > following TCL stages would be skipped. This sounds like the 'fastpath' stages which were common in drivers based on Mesa-3.x. We had a pipeline stage which most drivers supplied which was tuned to handle quake-3 cva style rendering operations. It was pretty fast, but in the end not much faster than Mesa-4.x standard operation. The fallback hardware tcl processing in the radeon drivers is installed as a pipeline stage also. Keith |
From: Keith W. <ke...@tu...> - 2003-04-05 06:47:46
|
Ian Romanick wrote: > Before going down that road we'd want to sit down with oprofile and a > bunch of applications to decide which sets of state we wanted to tune > for. IMHO, we'd be better to spend our time writing a highly optimized > just-in-time compiler for ARB_vertex_program. Then we could just write > vertex programs for the different "important" state vectors and let the > compiler generate the super-loop. Of course, there are still "issues" > with vertex programs. :( Oprofile doesn't do a good job on runtime-generated code, yet. I guess they're getting a bit more stabilized now, so it might be time to bring up the idea/issue/problem with them... I'm not sure what solution there could be, especially as oprofile isn't really tied to a single run of a program. Keith |
From: Keith W. <ke...@tu...> - 2003-04-05 06:21:22
|
> The things I found more interesting in the issue of applting the TCL > operations on all the vertices at once, or a vertice at each time. From > previous discussions on this list it seems that nowadays most > of CPU performace is dictated by the cache, so it really seems the later > option is more efficient, but Mesa implements the former (they are even > called "pipeline stages") and to change would mean a big overhaul of the > TnL module. Doing it in arrays is better from an instruction cache point of view, and as long as the arrays are small enough to fit in cache, there's no penalty from a data cache point of view. That's the point of eg, the code in t_array_api.c which cuts large arrays up into 256-vertex chunks for processing by the tnl pipeline. Keith |
From: Keith W. <ke...@tu...> - 2003-04-04 21:26:22
|
José Fonseca wrote: > On Fri, Apr 04, 2003 at 08:48:35AM -0700, Brian Paul wrote: > >>In general, this sounds reasonable but you also have to consider >>performance. >>The glVertex, Color, TexCoord, etc commands have to be simple and fast. As >>it is now, glColor4f (for example) (when implemented in X86 assembly) is >>just a jump into _tnl_Color4f() which stuffs the color into the immediate >>struct and returns. Something similar is done in the R200 driver. >> >>If the implementation of _tnl_Color4f() involves a call to >>producer->Color4f() we'd lose some performance. > > > I know, but my objective is to design a good object interface on which > all drivers may fit and reuse code. When a driver gets to the point > where the producer->Color4F() calls are the main performance bottleneck > (!?) the developer is free to write a tailored version of TnLProducer > that elimates that extra call: > > class TnLProducerFast { > > Vertex current; > TnLConsumer *consumer; > > TnLProducer(TnLConsumer *_consumer) { > consumer=_consumer; > } > > void activate() { > _glapi_setapi(GL_COLOR3f, _Color3f) > ... > } > > static _Color3f(r, g, b) { > TnLProducer *self = GET_THIS_PTR_FROM_CURRENT_CTX(); > self->current.r = r; self->current.g = g; self->current.b = b; > } > > }; > > We can even generate automatically this TnLProducerFast from the > original TnLProducer with a template, i.e., > > template < class T > > class TnLProducerTmpl { > > T tnl; > > void activate() { > _glapi_setapi(GL_COLOR3f, _Color3f) > ... > } > > static _Color3f(r, g, b) { > TnLProducerTmpl *self = GET_THIS_PTR_FROM_CURRENT_CTX(); > self->tnl.Color3f(r, g, b); // This call is eliminated if T::Color3f > // is inlined > } > } > > typedef TnLProducerTmpl< TnLProducer > TnLProducerFast; > > But this is all of _very_ _little_ importance when compared by the > ability of _writing_ a full driver fast, which is given by a well > designed OOP interface. As I said here several times, this kind of > low-level optimizations consume too much development time causing that > higher-level optimizations (usually with much more impact on > performance) are never attempted. The optimization of the vertex api has yeilded huge improvements. Even with the runtime-codegenerated versions of these functions in the radeon/r200 driver, they *still* dominate viewperf profile runs - meaning that *all other optimizations* are a waste of time for viewperf, because 60% of your time is being spent in the vertex api functions. > >>Nowadays, vertex arrays are the path to use if you really care about >>performance, of course, but a lot of apps still use the regular >>per-vertex GL functions. Except for applications that already exist and use the vertex apis -- of which there are many. And vertex arrays aren't the fastpath any more, but things like ARB_vertex_array_object or NV_vertex_array_range. > Now that you mention vertex array, for that, the producer would be > different, but the consumer would be the same. For developing a driver, it's not necessary to touch the tnl code at all - even hardware t&l drivers can quite happily plug into the existing mechanisms and get OK performance. Keith |
From: F. <jrf...@tu...> - 2003-04-05 00:14:18
|
On Fri, Apr 04, 2003 at 10:23:21PM +0100, Keith Whitwell wrote: > > The optimization of the vertex api has yeilded huge improvements. Even > with the runtime-codegenerated versions of these functions in the > radeon/r200 driver, they *still* dominate viewperf profile runs - meaning > that *all other optimizations* are a waste of time for viewperf, because > 60% of your time is being spent in the vertex api functions. I was underestimating its importance then... > >>Nowadays, vertex arrays are the path to use if you really care about > >>performance, of course, but a lot of apps still use the regular > >>per-vertex GL functions. > > Except for applications that already exist and use the vertex apis -- of > which there are many. > > And vertex arrays aren't the fastpath any more, but things like > ARB_vertex_array_object or NV_vertex_array_range. > > > >Now that you mention vertex array, for that, the producer would be > >different, but the consumer would be the same. > > For developing a driver, it's not necessary to touch the tnl code at all - > even hardware t&l drivers can quite happily plug into the existing > mechanisms and get OK performance. For now I'll be also plugging in the C++ classes into into the existent T&L code, but in the future I may want to change the way the software T&L interfaces with the [hardware] rasterizers, since the current interface makes it difficult to reuse vertices when outputing tri- or quad-strips, i.e., you keep sending the same vertices over and over again, even if they are the same betweeen consecutive triangles. José Fonseca |
From: Keith W. <ke...@tu...> - 2003-04-05 06:40:10
|
José Fonseca wrote: > On Fri, Apr 04, 2003 at 10:23:21PM +0100, Keith Whitwell wrote: > >>The optimization of the vertex api has yeilded huge improvements. Even >>with the runtime-codegenerated versions of these functions in the >>radeon/r200 driver, they *still* dominate viewperf profile runs - meaning >>that *all other optimizations* are a waste of time for viewperf, because >>60% of your time is being spent in the vertex api functions. > > > I was underestimating its importance then... > > >>>>Nowadays, vertex arrays are the path to use if you really care about >>>>performance, of course, but a lot of apps still use the regular >>>>per-vertex GL functions. >> >>Except for applications that already exist and use the vertex apis -- of >>which there are many. >> >>And vertex arrays aren't the fastpath any more, but things like >>ARB_vertex_array_object or NV_vertex_array_range. >> >> >> >>>Now that you mention vertex array, for that, the producer would be >>>different, but the consumer would be the same. >> >>For developing a driver, it's not necessary to touch the tnl code at all - >>even hardware t&l drivers can quite happily plug into the existing >>mechanisms and get OK performance. > > > For now I'll be also plugging in the C++ classes into into the existent > T&L code, but in the future I may want to change the way the software > T&L interfaces with the [hardware] rasterizers, since the current > interface makes it difficult to reuse vertices when outputing tri- or > quad-strips, i.e., you keep sending the same vertices over and over > again, even if they are the same betweeen consecutive triangles. Honestly, that's just not true. You can hook these routines out in a bunch of different ways. Look at t_context.h - the tnl_device_driver struct. In here there's two table sof function pointers 'PrimTabVerts' and 'PrimTabElts' which hand off whole transformed, clipped primitives - tristrips, quadstrips, polygons, etc. - to the driver. Hook these functions out and you can send the vertices to hardware however you like. Also look at tnl_dd/t_dd_dmatmp.h, as used in mga/mgarender.c and elsewhere -- this is an even more direct route to hardware and is very useful if you have a card that understands tristrips, etc. It probably isn't much use for a mach64, though. Clipped triangles are more difficult to handle, we currently call tnl->Driver.Render.PrimTabElts[GL_POLYGON] for each clipped primitive. If you can think of a better solution & code it up, I'd be interested to see it. It would be interesting to see some other approaches, but I think this one actually ends up not being too bad. Keith |
From: F. <jrf...@tu...> - 2003-04-05 09:53:30
|
On Sat, Apr 05, 2003 at 07:37:13AM +0100, Keith Whitwell wrote: > José Fonseca wrote: > >For now I'll be also plugging in the C++ classes into into the existent > >T&L code, but in the future I may want to change the way the software > >T&L interfaces with the [hardware] rasterizers, since the current > >interface makes it difficult to reuse vertices when outputing tri- or > >quad-strips, i.e., you keep sending the same vertices over and over > >again, even if they are the same betweeen consecutive triangles. > > Honestly, that's just not true. You can hook these routines out in a bunch > of different ways. > > Look at t_context.h - the tnl_device_driver struct. In here there's two > table sof function pointers 'PrimTabVerts' and 'PrimTabElts' which hand off > whole transformed, clipped primitives - tristrips, quadstrips, polygons, > etc. - to the driver. Hook these functions out and you can send the > vertices to hardware however you like. I thought the interface exposed by t_dd_tritmp.h was the only kind of interface exposed by TnL. I really have to study the TnL a little more (perhaps doing a quick documentation too). > Also look at tnl_dd/t_dd_dmatmp.h, as used in mga/mgarender.c and elsewhere > -- > this is an even more direct route to hardware and is very useful if you > have a card that understands tristrips, etc. It probably isn't much use > for a mach64, though. Quite the opposite: Mach64's triangle setup engine is designed in such way that allows you to update vertices selectively (i.e., for each vertice in the DMA buffer you specify which of the 3 vertices of the traingle setup engine you're updating) allowing to cope with triangle strips or any kind. So it may be useful at some time to switch to t_dd_dmatmp.h instead of t_dd_tritmp.h on the Mach64 driver. > Clipped triangles are more difficult to handle, we currently call > tnl->Driver.Render.PrimTabElts[GL_POLYGON] for each clipped primitive. If > you can think of a better solution & code it up, I'd be interested to see > it. It would be interesting to see some other approaches, but I think this > one actually ends up not being too bad. For the rasterization I was also considering using a two stage producer-consumer scheme: A->B->C. The producer A is chosen according the vertex format, while C is always the same and only emits the vertice(s) received to the hardware. Vertices don't necessarily have to be copied across the stages and a reference can be used whenever needed. For dealing with clipped vertices when passing them to B there would be a clipped flag associated with every vertex and they would all follow some ordering convention. B could be implemented to just call the Polygon version of B, or being smarter and reuse some of the vertices already downstream. José Fonseca |
From: Ian R. <id...@us...> - 2003-04-07 15:37:36
|
Jos=E9 Fonseca wrote: > On Sat, Apr 05, 2003 at 07:37:13AM +0100, Keith Whitwell wrote: >>Also look at tnl_dd/t_dd_dmatmp.h, as used in mga/mgarender.c and elsew= here=20 >>--=20 >>this is an even more direct route to hardware and is very useful if you= =20 >>have a card that understands tristrips, etc. It probably isn't much us= e=20 >>for a mach64, though. >=20 > Quite the opposite: Mach64's triangle setup engine is designed in such > way that allows you to update vertices selectively (i.e., for each > vertice in the DMA buffer you specify which of the 3 vertices of the > traingle setup engine you're updating) allowing to cope with triangle > strips or any kind. So it may be useful at some time to switch to > t_dd_dmatmp.h instead of t_dd_tritmp.h on the Mach64 driver. Just as a random side note, this sounds like the way some Sun hardware=20 works. http://oss.sgi.com/projects/ogl-sample/registry/SUN/triangle_list.txt Does any of the other hardware supported by open-source drivers work=20 this way? Just curious... |
From: Keith W. <ke...@tu...> - 2003-04-07 15:48:11
|
Ian Romanick wrote: > José Fonseca wrote: > >> On Sat, Apr 05, 2003 at 07:37:13AM +0100, Keith Whitwell wrote: >> >>> Also look at tnl_dd/t_dd_dmatmp.h, as used in mga/mgarender.c and >>> elsewhere -- >>> this is an even more direct route to hardware and is very useful if >>> you have a card that understands tristrips, etc. It probably isn't >>> much use for a mach64, though. >> >> >> Quite the opposite: Mach64's triangle setup engine is designed in such >> way that allows you to update vertices selectively (i.e., for each >> vertice in the DMA buffer you specify which of the 3 vertices of the >> traingle setup engine you're updating) allowing to cope with triangle >> strips or any kind. So it may be useful at some time to switch to >> t_dd_dmatmp.h instead of t_dd_tritmp.h on the Mach64 driver. > > > Just as a random side note, this sounds like the way some Sun hardware > works. > > http://oss.sgi.com/projects/ogl-sample/registry/SUN/triangle_list.txt > > Does any of the other hardware supported by open-source drivers work > this way? Just curious... Most of them just export the Direct3D primitive types, plus perhaps the additional GL ones. I think Mach64 might support this, and maybe the r128 too. Keith |
From: F. <jrf...@tu...> - 2003-04-08 13:19:03
|
On Mon, Apr 07, 2003 at 08:37:10AM -0700, Ian Romanick wrote: > José Fonseca wrote: > >On Sat, Apr 05, 2003 at 07:37:13AM +0100, Keith Whitwell wrote: > >>Also look at tnl_dd/t_dd_dmatmp.h, as used in mga/mgarender.c and > >>elsewhere -- > >>this is an even more direct route to hardware and is very useful if you > >>have a card that understands tristrips, etc. It probably isn't much use > >>for a mach64, though. > > > >Quite the opposite: Mach64's triangle setup engine is designed in such > >way that allows you to update vertices selectively (i.e., for each > >vertice in the DMA buffer you specify which of the 3 vertices of the > >traingle setup engine you're updating) allowing to cope with triangle > >strips or any kind. So it may be useful at some time to switch to > >t_dd_dmatmp.h instead of t_dd_tritmp.h on the Mach64 driver. > > Just as a random side note, this sounds like the way some Sun hardware > works. > > http://oss.sgi.com/projects/ogl-sample/registry/SUN/triangle_list.txt Thanks for the pointer. I think I'm going to use some of the ideas here for the the C++ rasterizer interface to model in a hardware indepednet fashin the possible triangle setup engines. As a curiosity, the reason Mach64 accepts this triangle strip model is that the DMA buffer are basically register-value pairs, so one can choose the triangle vertex number to update by choosing the respective register, e.g. (DWORD per line), MACH64_VERTEX_1_X_Y 1 // 1 + 1 = 2 value DWORDS follow (x1 << 16) | y1 z1 MACH64_VERTEX_2_X_Y (x2 << 16) | y2 z2 MACH64_VERTEX_3_X_Y 1 (x3 << 16) | y3 z3 MACH64_VERTEX_2_X_Y 1 (x2 << 16) | y2 z2 ... José Fonseca |