RE: [Algorithms] portal engines in outdoor environments

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> From: jason watkins [mailto:jas...@po...]
> 
> > That's not the performance hit. You are submitting tris like this:
> >
> > BeginChunk
> > Draw tester tris
> > EndChunk
> > if ( previous chunk rendered )
> > {
> > Draw real tris
> > }
> 
> 
> Nope, think more like:
> begin(indexed_array);
> setboundshint(my_boundingvolume);
> draw(my_iarray);
> end(indexed_array);
> 
> it's handed off as a single, seperatable transaction.. the hint merely
> allows the hardware to quickly reject the entire array *if* 
> it's obvious
> it's hidden.. I would think this happens fairly often, like 
> when a character
> model is behind a wall, for example.

Looks the same to me - the point being that the "hint" is just before the
triangles that are gated by it. the hardware can't rearrange triangle order
or interleave or anything made like that - that's just not how hardware
works (except for wacky stuff like scene-capture architecture, which has
some very different problems to cope with, and so far has failed to live up
to its claims).

> So what you're not getting, is that the *if* is _not_ a 
> blocking *if*.

If it doesn't block, then it's not much good. Not block = does nothing (or
very little).

> It's
> just a hint.. the hardware can deal with the hint in many 
> ways..

If you want to abstract things this way, then this is very definately an API
abstraction. This is not something that can be done directly in hardware.

> it's true
> that it would work best in a heirarchical z pipeline, but it 
> should still
> work in the typical. How that z information gets relayed back to the
> rejection block is an open question..

Erm... faster than light.

OK, here is the pipeline of a typical T&L&rasterise chip:

-Index AGP read
-Vertex cacheing
-Vertex AGP read (1)
-Transform vertices
-Light vertices
-Clip vertices
-Project vertices
-Construct triangle from vertices
-Backface cull
-Rasteriser setup
-Rasterise
	-Read Z buffer
	-Test against Z buffer (2)
	-Etc. (rest of pixel pipeline).

So what you are asking is for results at (2) to affect what is done at (1)
on the very next triangle. the only way this can be done is to hold the
"drawn" triangles at (1) until all the "test" tris have passed (2). So the
pipeline from (1) to (2) is empty. It has no triangle info in it at all.
That is a huge bubble - probably hundreds of clock cycles long. You noticed
all that complex floating-point maths in the middle, didn't you? Each
floating-point operation has many pipelined clock stages, and there are a
lot of operations in that section of the chip. It's a massive bubble, and no
AGP FIFO is going to deal with those sorts of delays.

> but I can think of 
> several ways in a
> typical architecture.. it's cacheing scanlines anyhow,

Not in a typical architecture it's not. But let's say it was...

> so it could do
> something like relay the maximal value for every 4 z's in the 
> scanline being
> unloaded from cache back to the rejection block, where the 
> rejection block
> has it's own low res local cache.

OK, well there is only the Radeon that does this at the moment. It's cool,
but it's nowhere near commonplace. And it still requires that the test
triangles be rasterised - converted into pixel representation. The details
of Z-testing are not important. The fact that you first have to rasterise
them is the killer.

> The details of how this 
> works could take
> many different forms.. the point being is that you only need 
> delayed z info,
> and that having the hint processed on chip means that you can 
> do it inside a
> single frame instead of relying on a previous frame.

I just don't see the problem with relying on the previous frame. There are
hundreds of algorithms that we use every day in code that rely on
frame-to-frame coherency for speed. One more is not going to drive people
bonkers.

> > The problem is that the if() is being done at the very start of the
> pipeline
> > (i.e. the AGP bus - any later and you lose most of the gain),
> 
> Nope.. you gain fill rate by reduced depth complexity. You could gain
> effective polygon bandwidth as well.

No, you don't. Consider the case where the object is invisible:

You: draw & test bounding object. Don't draw real object.
Me: test previous frame's object results. Draw this frame's low-rez object.

Pretty even - we both draw an object that is roughly the right number of
pixels on-screen. Both get fast Z-rejects (by whatever hierarchial Z method
you like) and fetch no texels.

OK, now a drawn object:

You: draw & test bounding object. Draw real object.
Me: test previous frame's object results. Draw this frame's low-rez object.

Looks like I win. I drew one object, you draw an object & tested a bounding
object. True, you didn't need to fetch texels or write out results for your
bounding object, but you staill rasterised & checked _something_. I didn't.

Plus, you also had to send down the polygon data for your bounding object,
while I didn't. It's usually small for a bounding object, but it's not zero.

[snip stuff that also isn't right, but...]

> You misunderstood.. I never said anything about drawing 
> anything. Just a
> bounding volume hint, which is a very different thing. 
> There's plenty of
> existing work for converting a OBB to exact screen regions 
> *very* quickly
> without resorting to scan conversion/rasterization. We're 
> only interested in
> conservative values as well, since it's common for a 
> character model to be
> completely seperate from a set of wall polygons.

OK, if you did this sort of incredibly conservative (i.e. add hardware to
T&L OBB in some quick but conservative way, find enclosing screen BB, test
all Z values using some sort of quickie rectangle rasteriser, somehow
dodging the bullet of concurrent Z-buffer access with polys that are
currently being rasterised), maybe it would work sometimes. But remember
that you're finding the screen BB of an OBB. So the area being tested is
quite big compared to your original shape. And that's still a decent chunk
of hardware.

I _still_ don't see what is so bad about adding zero hardware to existing
chips and using some frame-to-frame coherency.

> > (2) No extra triangles needed. OK, the bounding box is a 
> pretty small
> number
> > of tris, but what if you wanted to do this scheme with lots 
> of smallish
> > object? Might get significant then.
> 
> again, it's a hint, not a set of triangles.. no added triangles.
> 
> > (3) (and this is the biggie) It is already supported by 
> tons of existing,
> > normal, shipped, out there hardware. Not some mystical 
> future device.
> Real,
> > existing ones that you have probably used.
> 
> *very* true, very good point.

Indeedy.

[snip]
> > Huge deal if the delay is longer than a few tens of clock 
> cycles. The AGP
> > FIFOs are not very big, and bubbles of the sort of size you 
> are talking
> > about are not going to be absorbed by them. So for part of 
> your frame, the
> > AGP bus will be sitting idle. And if, as is happening, you 
> are limited by
> > AGP speed, that is going to hurt quite a lot.
> 
> as long as the gap in the pipeline as it starts fetching the 
> next stream
> after a rejection is shorter than the number of cycles it 
> would have taken
> to finish the rejected stream, you win. Considering that a cycle = 4
> rasterized pixels or so, and that trinagles typically are 
> 4-8x that, and
> that arrays are typically 10 tris or more, I think it's not 
> to much of a
> worry. Unless it really does take 200 cycles of a ~150mhz part to set
> up/redirect the dma.

You're confusing _throughput_ with _latency_. The typical on-screen textured
pixel may take thousands of clock cycles from its triangle being read in by
the AGP bus, to actually being written onto the screen. However, the next
one will be right behind it. so the throughput of chips is massive, but the
latency is terrible. What you are relying on is a short latency. And in the
graphics chip world, latency is very very expendable. Huge pipelines,
massively parallel, quarter of the chip is FIFOs, multiple stages in even
the simplest operations. That is what makes graphics chips fast. Do not
stall the pipe, or you're toast. Those are the keys. This technique blows
all that out of the water.

[snip]
> > What's wrong with frame-to-frame coherence? Remember, if 
> there is a camera
> > change or cut, the application can simply discard all the 
> visibility info
> it
> > has and just draw everything, until it has vis information 
> for the new
> > camera position.
> 
> A couple things.. originally I didn't think this was a big 
> deal, but later
> changed... I think making assumptions is bad, and I 
> definately think that
> consistant framerate is more important than a high 
> instantanious. Nothings
> more annoying than jumping through a portal in a game to get 
> dropped frames
> for a few frames before it gets everything sorted out and 
> cached and gets
> back up to 60fps (or whatever the target is). Making the 
> granularity of
> rejection sub-frame should help avoid this... Also, when 
> you're using an in
> engine cinematic approach, it's really annoying when you get 
> a dropped frame
> every time the camera cuts.

This is highly app-specific though - the app can happily modify its
interpretation of the results based on the above. Whereas if you leave it
all up to the hardware, it can get very had to get consistent framerates.

Your method actually _removes_ control from the app. That is not going to
help to get consistent, smooth framerates - if anything, it will give you
the opposite.

> > No no no. There is no way you could get the _hardware_ to 
> reject state
> > change info and texture downloads because of some 
> internally fed-back
> state.
> > Drivers rely very heavily on persistent state, i.e. not 
> having to send
> state
> > that doesn't change. If the driver now doesn't know whether 
> the state
> change
> > info it sent actually made it to the chip's registers or 
> not, that's just
> > madness - the driver will go potty trying to figure out 
> what it does and
> > doesn't need to send over the bus. Ditto for downloading 
> textures. Since
> the
> > driver can't know whether the hardware is going to reject 
> the tris or not,
> > it always has to do the texture downloads anyway. And if 
> the hardware is
> > doing the downloads instead (e.g. AGP or cached-AGP 
> textures), then either
> > solution works - the fast-Z-rejection of pixels means those 
> textures never
> > get fetched.
> 
> Ouch.. hadn't thought much about the driver related issues. 
> However, *if*
> state was constant accross a primative, it's not a problem. 
> That would be a
> big issue, but I don't think it's insurmountable.

Except that was one of your supposed "plus" points - that state wouldn't
have to be changed if the object was rejected!

> So, maybe I'm foggy on some details.. but I still thing early 
> rejection in
> rasterization pipes is a *good thing*tm :).

(1) There _is_ early rejection in rasterisation pipes. Hierarchial Z is
massively cool, but relatively conventional.

(2) It's not the rasteriser that needs the speeding up. We have some
awesomely fast rasterisers at the moment. But the T&L is a bottleneck under
some situations (complex lighting, high tesselation), and the AGP bus is the
bottleneck under others. Those are the things that need conserving right
now.

Tom Forsyth - Muckyfoot bloke.
Whizzing and pasting and pooting through the day.