RE: [Algorithms] portal engines in outdoor environments

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> From: jason watkins [mailto:jas...@po...]
> 
> 
> > But _something_ still has to do the readback. The start of 
> the chips'
> > pipeline still needs to know the results at the framebuffer 
> end before it
> > can decide whether or not to start drawing the triangles. It doesn't
> matter
> > whether the CPU does it or some bit in the T&L section of 
> the chip - it's
> > the same issue - something has to wait for pixels to go 
> most of the way
> down
> > the rendering pipeline. And that's a performance lose.
> 
> Right, of course the rejection logic has to be able to read 
> the relavent
> z's. But, it doesn't have to read the most immediate state of 
> the z.. the
> state a few polygons ago will do fine for a conservative 
> rejection. So I
> don't see where a stall happens, or where there's a 
> performance lose.

That's not the performance hit. You are submitting tris like this:

BeginChunk
Draw tester tris
EndChunk
if ( previous chunk rendered )
{
	Draw real tris
}

The problem is that the if() is being done at the very start of the pipeline
(i.e. the AGP bus - any later and you lose most of the gain), but it needs
to know the results of the rasterisation & Z-test of all the pixels in the
tester tris. That rasterisation and Z-test is comparatively late in the
pipeline on a T&L device (it's early in the rasterisation, but there is a
looooong pipe between the AGP bus and pixel rasterisation). So your pipeline
is going to be completely empty between those two points. That is a huge
pipeline bubble, and if you're doing it more than one or twice a frame, you
are going to lose large amounts of performance.

You may be able to improve things by doing:

for i = 0 to nObjects
{
	BeginChunk(i)
	Draw tester tris
	EndChunk
}
for i = 0 to nObjects
{
	If ( chunk(i) rendered )
	{
		Draw object i
	}
}

But that is a lot of extra hardware to store all that chunk information,
retrieve it and so on. Lots of complexity. There are three very nice things
about the frame-to-frame coherency scheme:

(1) No extra fillrate hit. If the object is invisible, it's the same
fillrate as your scheme. If the object is visible, then you still only draw
it once, not twice as with your scheme.

(2) No extra triangles needed. OK, the bounding box is a pretty small number
of tris, but what if you wanted to do this scheme with lots of smallish
object? Might get significant then.

(3) (and this is the biggie) It is already supported by tons of existing,
normal, shipped, out there hardware. Not some mystical future device. Real,
existing ones that you have probably used.

> Maybe
> the gain is reduce, when you think about how the rejection 
> means there are
> skips in the flow from AGP RAM to the cards local 
> storage/instruction bus,
> but as I understand it, that's all controled by DMA's from 
> the card anyhow,
> so not a big deal.

Huge deal if the delay is longer than a few tens of clock cycles. The AGP
FIFOs are not very big, and bubbles of the sort of size you are talking
about are not going to be absorbed by them. So for part of your frame, the
AGP bus will be sitting idle. And if, as is happening, you are limited by
AGP speed, that is going to hurt quite a lot.

> It doesn't rely on frame2frame conherance (which I feel is often a bad
> thing). Perhaps it would maybe be best with a heirarchical z system.

Doesn't help - you still need to rasterise your pixels, which is a long way
down the pipe.

What's wrong with frame-to-frame coherence? Remember, if there is a camera
change or cut, the application can simply discard all the visibility info it
has and just draw everything, until it has vis information for the new
camera position.

> > Added to which, one of the growing bottlenecks is the AGP 
> bus, and doing
> it
> > this way doesn't help that at all.
> 
> Not directly, well.. not unless your primatives are so large that the
> rejection happens before the primative has entirely streamed 
> into the cards
> local FIFO or whatever. However, done right, it could 
> potencially reduce
> state changes or texture downloads, aleavating some bus 
> bandwidth.

No no no. There is no way you could get the _hardware_ to reject state
change info and texture downloads because of some internally fed-back state.
Drivers rely very heavily on persistent state, i.e. not having to send state
that doesn't change. If the driver now doesn't know whether the state change
info it sent actually made it to the chip's registers or not, that's just
madness - the driver will go potty trying to figure out what it does and
doesn't need to send over the bus. Ditto for downloading textures. Since the
driver can't know whether the hardware is going to reject the tris or not,
it always has to do the texture downloads anyway. And if the hardware is
doing the downloads instead (e.g. AGP or cached-AGP textures), then either
solution works - the fast-Z-rejection of pixels means those textures never
get fetched.

Far better is the delayed system, where the app can choose as well as using
lower-tri models, to force lower mipmap levels as well, to reduce texture
downloads. Or not, if that looks bad. This sort of decision MUST be left up
to the app of course, because there will alsways be cases where a shortcut
looks bad, and only the app can know whether they are acceptable in certain
cases or not.

[snip]

> Especially considering that we're quickly reaching the limits 
> of what fill
> rate can be supported by available memory technologies (at 
> the right price
> that is). And tho embeded dram seems to answer that, it and 
> other possible
> technologies haven't exactly materialized.

Right. But I've already pointed out that the delayed version uses _less_
fillrate than what you are proposing, not more, because the "tester" tris
are the same as the tris actually drawn.

Tom Forsyth - Muckyfoot bloke.
Whizzing and pasting and pooting through the day.