RE: [Algorithms] PVA hierarchies (was: Terrain performance etc)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> From: Mark Duchaineau
> 
> Hi Tom,
> 
> Erhm, this really needs a whole paper to explain (which I'm working
> on), but here's an outline of the ideas...

I'll try to keep up :-)

> I should first note that what I'm about to describe is for the
> "wacky general geometry" case...terrain with triangle
> bintrees is much simpler.

Ah, explosion isosurfaces. Yep, wacky alright. I don't think us game
guys have nearly that sort of nastiness. Hmmm... gives me an idea for a
game :-)

> The progressive vertex array implementation I'm putting together is
> based on many of the ideas you and Charles Bloom and others on this
> list have tossed around about VIPMs with sliding windows, dealing
> with vertex cache coherence, and dealing with seams nicely when
> chunking.  My first implementation assumes a completely static set
> of geometry and takes a long time to make really coherenct and high
> quality progressions into a single vertex+index array per chunk.
> I use queue-based edge collapse ala memoryless simplification
> (Lindstrom's work) but allowing edge collapses that create
> non-manifold geometry (in effect allowing arbitrary amounts of
> collapse, not limited by the genus of the surface).

Daring! But then you can't simply smack your artists around until they
clean up the geometry.

> Collapses are organized into some kind of spatial 3D (volumetric)
> hierarchy (specifically the 3D extension of ROAM "diamonds"),
> where the collapses can affect only the triangles/vertices/edges
> on the interior of the 3D cells.  To avoid problems like in 
> Hoppe's vis98 terrain paper, which keeps the mesh on cell
> boundaries at fullest resolution,  the cells don't nest like
> octrees but rather alternate as with ROAM diamonds.  In 3D there
> are three phases of diamond cells, so you get collapses for any
> edge in at least two out of three phases of collapses, thus you
> end up with no edges getting "stuck" at full resolution.  For 2D
> cell hierarchies that are not too deep (e.g. Hoppe's terrain
> examples), keeping edges at full resolution sort-of works okay.
> But in 3D for general geometry you are hosed (I tried it...).
> The alternating boundaries are critical.

Now this IS a cool insight. This could work just as well on 2D grids,
and avoids all the mucking around with flanges and fins. I still like
the "no crosstalk" purity of flanges/fins which makes each chunk
completely isolated, but in practice I don't think it's actually that
big a win as long as your chunks are sensibly large, and doesn't create
the sort of topological constraints that flanges/fins have.

> Within each 3D diamond you get a progression that is put
> into a single vertex+index array (not stripped, but
> very vertex-cache coherent).

Strips are overrated anyway. Modern hardware does indexed lists just as
well. I use whichever uses the fewest indices, since a lot of hardware
has to have indices copied over by the CPU, and even if they are handled
natively, it's still often got to go over the AGP bus. If it's at all a
hassle to use strips, I don't - life is too short and branches too easy
to mispredict. :-)

> There is a kind of priority
> queue to order the triangles in cache coherent order,
> while at the same time re-ordering the progression a
> bit (in essence clustering some of the collapses into
> a single neighborhood collapse involving a user-specified
> maximum number of triangles, and also walking these "clusters"
> in vertex cache coherent order).  The collapse dependencies
> are used to figure out the clustering.  Lots more details
> here.

I can guess some of them. Using sliding window places restrictions on
what you can collapse when while avoiding switching buffers completely -
I can see that adding a vertex-coherency measure in there could tweak
the order even more. Though I'm getting 140Mtris/sec on my R9700 here,
so I'm not displeased with the results so far.

[snip]
> I generally have thousands to tens of thousands of vertices
> per 3D diamond at the moment (haven't really tuned this yet).

Feels about right.

> Vertex counts are reduced by a modest factor at each
> phase (dividing the count by the cube root of four is a good
> theoretical choice for smooth geometry).

I get about this factor when doing manifold collapses - it's a decent
balance between doing out-of-optimal order collapses and having to
switch index sub-buffers. One slightly worrying thing is that my index
buffers are now about the size of my vertex buffers for simple vertices,
but if you start to put any sort of tangent-space information in the
vertices it's back up to little-and-large levels again, so I don't think
it's too significant.

> On the macro scale, ROAM split-merge is performed on the
> diamonds.  Frustum culling is done ala the ROAM paper
> but only on the octree cell phases, where you have
> proper nesting.  The interesting details here are how
> to quickly get the slider (progression window start/end
> indices) in the split-merge process quickly.

I just use a big array - it's only a 64-bit entry: [32-bit start of
indices (my index buffers get big!), 16-bit vertex count, 16-bit index
count] for each level. It's a cache miss, but doing a lot of maths is a
branch predict miss. There's life in the old LUT yet :-)

> I use OpenGL ATI object buffer extensions under linux
> on the radeon 9700 pro.  For a huge computational
> fluid dynamics isosurface representing the mixing boundary
> of two gases undergoing violent (explosion-induced)
> turbulence, i.e. a REALLY nasty surface, I am getting
> a solid 100M tri/sec with essentially no CPU work
> per frame (from the app anyway -- I can't speak for
> the drivers).

The drivers on the 9700 should be happily throwing commands straight at
the card with very little work unless you're changing other pipeline
state. Other cards will need to copy the indices from your buffer to the
card's command buffer, so there will be some work there, but still not
too heavy - really just a memcpy to AGP memory.

> This surface has depth complexity 50
> on average, so I really need to minimize the window size
> to get this performance (we haven't gotten the
> occlusion culling for this finished yet -- that's
> another big project!).
> 
> My goal is to get this written up by the end of the year
> with a full demo.  The code is largely all there,
> enough to get numbers and show early demos.

Cool. Many thanks for the info. I like that alternating-pattern idea.

> Cheers,
> 
> --Mark D.