|
From: Keith W. <kei...@ya...> - 2006-09-12 16:46:34
|
Brian Paul wrote:
> Keith Whitwell wrote:
>> Allen Akin wrote:
>>
>>> On Thu, Sep 07, 2006 at 04:32:43PM -0600, Brian Paul wrote:
>>> | ...
>>> | Looks like a rare case where PBOs are faster! ...
>>>
>>> Yep. Although on an AGP system, I thought PBOs might be more of a win
>>> for DrawPixels and TexImage than for ReadPixels.
>>> | ... though, when the pixel data is actually touched, performance
>>> drops | with PBOs.
>>>
>>> There's always a catch. :-)
>>
>> I had a quick look at the test and I wonder if you are getting the
>> maximum benefit from PBO's.
>>
>> It seems like you're making an attept to get asynchronous transfers
>> going, but I don't think that full pipelining is possible with the usage
>> in the test, at least when the SumOut flag is true.
>>
>> In particular, the chain of events:
>>
>> 1) Render
>> 2) Issue ReadPixels for top half to pbo0
>> 2a) Issue ReadPixels for bottom half
>> 3) Map pbo0
>>
>> On theoretic/perfect hardware, I believe you are going to be hitting an
>> unnecessary stall at (3) because the ReadPixels calls should return
>> instantaneously - they just make a request and will not block. Hence
>> the request to Map pbo0 will almost certainly occur before the transfer
>> has completed, and maybe even before rendering has completed.
>>
>> To get full asynchronous behaviour, something like the following might
>> be better. Note that I am not preserving the top/bottom split:
>>
>> i = 0;
>> Render Frame
>> Bind PBO(i)
>> Issue ReadPixels
>>
>>
>> while (1) {
>> Render Frame
>> Bind PBO(i ^ 1)
>> Issue Readpixels
>>
>> Bind PBO(i)
>> MapBuffer
>> process data
>>
>> i = i^1;
>> }
>>
>>
>> The hope is that rendering and transfer overlaps with the CPU processing
>> of data and that we don't have any path that attempts to pull data back
>> to the CPU and map it immediately without some processing step inbetween.
>
> I was thinking of that too, but I was intentionally following the
> suggested example for asynchronous readback seen at the end of the
> GL_ARB_pixel_buffer_object spec. I figured if anyone was implementing
> true asynchronous readback they'd support the model seen there.
Drivers will be able to optimize/overlap the second transfer, but the
first one is going to force a stall as I mentioned.
The example in the spec is a pretty pared down demonstration that will
get some benefit by overlapping the second transfer while the CPU is
processing the results from the first. It won't get the most out of the
technique, but it is an easy chunk of code to drop into an existing
application if it happens to fit the usage model.
For really big transfers, assuming you can't go to the fully asyncronous
model, there might be a benefit in splitting the transfer into 3 or more
pieces.
> I could code-up your approach as well though. However, I don't think
> it would work with the Chromium VNC SPU (which is where I'm probably
> going to use this).
It's certainly worthwhile to try and avoid stalls where possible. If
it's not possible, the approach from the spec has merit.
>
>> Also note, when SumOut is false, a clever driver could notice that
>> nothing has changed and the readpixels need only be performed on the
>> first iteration through the loop. In fact, a clever driver could note
>> that the data is never Mapped or otherwise requested by the CPU and
>> eliminate all copying...
>
> I don't think that could ever be safely implemented by a driver. An
> example would be one thread doing glReadPixels and another thread
> touching the data we read back.
The driver would only have to delay issuing the transfer request until
the application tried to map the data. If a second ReadPixels is
received before there a call to MapBuffer, the transfer could be
aborted/never issued. The data isn't available for any thread to look
at until MapBuffer or GetData is called.
Keith
|