From: James S. <arr...@gm...> - 2006-09-04 22:05:32
|
I have a dual head ATI system (One GPU, two Render SPUs), which performs very poorly. After much profiling I have determined that it is spending a lot of time (near 99.4%) in fglrx.so. I find that the functions in this library are mostly invoked by functions related to the processing of unrolled glDrawElemnts commands (glArrayElement calls are causing this CPU usage). I wrote a simple test application. My results show poor performance by the ATI GPU when two concurrent processes use VBOs and glArrayElement to draw objects. As I only need to be rendering to one of the ATI heads at a time, I think a possible solution is to filter out unneeded glDrawElements commands. This could be done by checking the rendering window rectangle against the rectangle of each monitor. If the rectangles intersect, we would set the Pack Buffer to thread->buffer[current_server] and then do what we normally do to translate and pack the command for that server. Repeat for each server. When done, set the Pack buffer back to thread->geometry_buffer (This is what it was before, right)? This would prevent the glDrawElement command from effecting servers that do not have the GL rendering window on them. Is dropping glDrawElements commands for render SPUs who's monitors don't intersect the OpenGL output window acceptable practice? Will it cause problems for downstream SPUs? What is the best method to integrate such optimizations into Chromium? I am only aware of 2 types of pack buffers in the tilesort spu, the geometry_buffer, and the server specific buffers. Are there any others I should know about? Thank you for your time, James Steven Supancic III |
From: Brian P. <bri...@tu...> - 2006-09-05 18:14:29
|
James Supancic wrote: > I have a dual head ATI system (One GPU, two Render SPUs), which > performs very poorly. After much profiling I have determined that it > is spending a lot of time (near 99.4%) in fglrx.so. I find that the > functions in this library are mostly invoked by functions related to > the processing of unrolled glDrawElemnts commands (glArrayElement > calls are causing this CPU usage). I wrote a simple test application. > My results show poor performance by the ATI GPU when two concurrent > processes use VBOs and glArrayElement to draw objects. > > As I only need to be rendering to one of the ATI heads at a time, I > think a possible solution is to filter out unneeded glDrawElements > commands. > > This could be done by checking the rendering window rectangle against > the rectangle of each monitor. If the rectangles intersect, we would > set the Pack Buffer to thread->buffer[current_server] and then do what > we normally do to translate and pack the command for that server. > Repeat for each server. When done, set the Pack buffer back to > thread->geometry_buffer (This is what it was before, right)? This > would prevent the glDrawElement command from effecting servers that do > not have the GL rendering window on them. The tilesort SPU broadcasts VBO drawing commands to all crservers. The tilesort SPU's state tracker keeps a client-side copy of the VBO data but does not analyze VBO drawing commands to compute the bounding box (which would be used for bucketing). The cost of computing the bounding boxes in these cases could be more than just broadcasting the command. If you want to optimize things, you'll have to add new glDrawArrays/glDrawElements code to the tilesort SPU that computes bounding boxes. Unfortunately, you can't just look at a VBO to determine bounds since there's no way to interpret the VBO's data; you need the glVertexArray parameters, etc. which can vary from one draw to the next. > Is dropping glDrawElements commands for render SPUs who's monitors > don't intersect the OpenGL output window acceptable practice? Will it > cause problems for downstream SPUs? > > What is the best method to integrate such optimizations into Chromium? > > I am only aware of 2 types of pack buffers in the tilesort spu, the > geometry_buffer, and the server specific buffers. Are there any others > I should know about? Is your application putting its array indices into a GL_ELEMENT_ARRAY_BUFFER VBO? To get best performance, you want both your vertex data and indices to be in VBOs. -Brian |
From: James S. <arr...@gm...> - 2006-09-08 05:07:44
|
> The tilesort SPU broadcasts VBO drawing commands to all crservers. > The tilesort SPU's state tracker keeps a client-side copy of the VBO > data but does not analyze VBO drawing commands to compute the bounding > box (which would be used for bucketing). > > The cost of computing the bounding boxes in these cases could be more > than just broadcasting the command. > > If you want to optimize things, you'll have to add new > glDrawArrays/glDrawElements code to the tilesort SPU that computes > bounding boxes. > > Unfortunately, you can't just look at a VBO to determine bounds since > there's no way to interpret the VBO's data; you need the glVertexArray > parameters, etc. which can vary from one draw to the next. I am not talking about anything that advanced. Most of the time the rendering window will not be on both of the ATI monitors. Methinks a simple way to filter out unneeded glDraw* commands is to check: (thread->currentContext->currentWindow->server+index)->num_extents If the server does not have any extents, than it isn't really doing much? We should be able to drop a lot of commands? I tried checking this in a for loop, and using crPackSetBuffer(thread->packer,&(thread->buffer[index])); when it did not equal zero and I set it back to what it was before after the loop ended. I think I am now putting the unrolled draw elements call into the server specific buffers? Performance has gone up a lot, but there are some strange visual errors. It looks as if some coordinate data is being truncated? Obviously, putting the data into the server specific buffer doesn't have the same effect as putting it into the global buffer. What is the correct way to send a command to a single server? I think the thread->buffer buffers are for the exclusive use of the state tracker? Should I try to find a way to use the state tracker for sending the data to the servers as needed? Or maybe add new buffers and add code to the flush mechanism for this purpose? Thank you for your time, James Steven Supancic III |
From: Brian P. <bri...@tu...> - 2006-09-08 20:08:03
|
James Supancic wrote: >> The tilesort SPU broadcasts VBO drawing commands to all crservers. >> The tilesort SPU's state tracker keeps a client-side copy of the VBO >> data but does not analyze VBO drawing commands to compute the bounding >> box (which would be used for bucketing). >> >> The cost of computing the bounding boxes in these cases could be more >> than just broadcasting the command. >> >> If you want to optimize things, you'll have to add new >> glDrawArrays/glDrawElements code to the tilesort SPU that computes >> bounding boxes. >> >> Unfortunately, you can't just look at a VBO to determine bounds since >> there's no way to interpret the VBO's data; you need the glVertexArray >> parameters, etc. which can vary from one draw to the next. > > > I am not talking about anything that advanced. Most of the time the > rendering window will not be on both of the ATI monitors. Methinks a > simple way to filter out unneeded glDraw* commands is to check: > (thread->currentContext->currentWindow->server+index)->num_extents > If the server does not have any extents, than it isn't really doing > much? We should be able to drop a lot of commands? > > I tried checking this in a for loop, and using > crPackSetBuffer(thread->packer,&(thread->buffer[index])); > when it did not equal zero and I set it back to what it was before > after the loop ended. > > I think I am now putting the unrolled draw elements call into the > server specific buffers? > Performance has gone up a lot, but there are some strange visual > errors. It looks as if some coordinate data is being truncated? > > > Obviously, putting the data into the server specific buffer doesn't > have the same effect as putting it into the global buffer. What is the > correct way to send a command to a single server? > > I think the thread->buffer buffers are for the exclusive use of the > state tracker? > > Should I try to find a way to use the state tracker for sending the > data to the servers as needed? Or maybe add new buffers and add code > to the flush mechanism for this purpose? I think the best place to plug in this features is right after the bucketing stage. The bucketing stage looks at bounding boxes to determine which geometry buffers go to each crserver. The tilesortspuBucketGeometry() function produces a bitmask indicating which crservers need the geometry. You'll need to add something like this: for (i = 0; i < num servers; i++) if (server[i].extents are null) bucketInfo->hits[i / 32] &= ~(1 << (i % 32)); -Brian |