Re: [Algorithms] VIPM With T&L - what about roam

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Mark,

you have the basic picture right; Tom can clarify the
issues with the AGP bus better than I can.

First, card caches are getting pretty big; GeForce 2 and
Radeon have about 16 vert caches, and future cards will
have more.  This will mean with you can send 18 triangles
with only 12 verts (a 4x4 grid of verts, with one row of
the grid already in cache, make a 3x3 grid of triangle
pairs).  In practice, you can get about one triangle per
vert.  As for optimizing the cache, Tom and I have talked
about this a bit before, the basic technique being to
swap out to "prefixes" of optimized indices.  This incurs
no CPU work, it's just a problem of storing all these
prefixes; however, they can be shared, so if you have a
lot of model duplication (as we do in games) the waste can
be reclaimed.  The result is that you can keep all but a
small portion of your index chunk cache-optimal at all
times.

>2) In the simple VIPM scheme, just send three indices
>per tri, and verts and tris are listed in the arrays in
>progressive split order, the indices need to be updated
>for triangles whose vertices moved in any splits
>introduced in the frame, which you precompute as a kind
>of table lookup of state-change lists.  These changes
>will generally be scattered through index memory,
>leading to bad PC-memory cache behavior.  But you
>hope that only a tiny fraction of the indices need
>to be updated per frame (this is a very ROAM-like
>hope ;-).

This is actually not a problem for the CPU, because you
never ever read from the index list.  That means all you
ever do is writes, and you can just fire them off, and
they get retired asynchronously, and the CPU never stalls.
If you want to get fancy you can even tell the CPU cache
not to mirror that memory in cache, just write straight
to main memory.  The wonderful Athlon chip can have 32
outstanding queue'd stores, so this is no problemo.  The
problem would only arise if you write and read from the same
data structure, which we would never do.  Note that you must,
however, wait a while before rendering after you make your
index changes, or your AGP DMA will stall on the CPU stores
finishing.  If this were the bottleneck, we'd be golden.

Let me try to check your bandwidth numbers.  Let's imagine
that I can make great strips, so I just send an index per
triangle, and I'm using the cache well so I just send one
vert per triangle, and I'm aiming at 20 MtriHz.  First of
all, my vertices are 32 bytes (3*4 for the position, 2*4
for the UV's (they're floats),and 3*4 for the normal, cuz
I want the card to do my lighting for me), so I'm sending
34 bytes if you count the index = 600 Mb/sec.  Ok, so
now let's imagine I'm not using the cache perfectly, and I'm
using indexed lists, so I need 6 bytes for indices, and I'm
sending 1.5 verts per tri = 54 bytes/tri ; so I need just about
1 Gb/sec.  This is a bit optimistic to expect from your AGP,
but if you cut it in half, you're still getting 10 MtriHz.

The point is that the VIPM process did not hurt your triangle
rate in the least; any triangle rate you can get without VIPM,
you can get with VIPM too.  That wasn't true until Tom had the
idea for optimized prefixes of indices, but if you combine it
with VIPM stripping, it's true for any card.

>So...does it make sense
>to put some geometry info into graphics-card mem?

Probably.  Cards just don't/won't do it.  Anyway, X-Box will
solve this neatly with the unified memory architecture.  You
can do your VIPM in main memory, and the card will just tear
your data from their at breakneck speeds.  

>If you really want to minimize the number of times
>verts are sent, you need to allow much more
>index manipulation per frame to optimize, whether
>via precomputed state-change lists or through
>some yet unknown on-the-fly technique.

See above (and search the archives) about switching index
prefixes. =

>3) In the "stripped" version of VIPM, cover the chunk
>with strips (chosen in a particular way?) and fiddle with
>the indices just as in case (2).

First, any card that can do 20 MTris/sec won't blink at
getting rid of the degenerate triangles.  Stripped-VIPM only
really works with the swap-out technique.  You start with
a strip for the model.  Assuming perfect stripping for the
moment, you've cut your index cost to 1/3.  Note that the
update process is indentical if I arrange my "vsplit/ecoll"
structure a bit.  Instead of storing a triangle+vert to
change, I store an offset in the index list to change.  Then
the vsplit/ecoll doesn't care if it's strips or lists that
are being modified.  Ideally, with strips you'd change fewer
indices with each vsplit/ecoll, because many indices are shared.
Once you've reduced down to some fraction F of the triangles,
you switch to a new optimal stripping (which has removed the
degenerate triangles and perhaps re-ordered the walk for
better caching).  In the worst case, you are storing N indices
when you could store N*3*F indices.  You're also sending N
triangles when you could send N*F triangles.  I think F = 2/3
is a good number to switch at.  Really, with list swapping
the stripping and listing methods are very similar.  It's just
a matter of A. lists are more flexible and friendly, B. strips
area bit more compact, C. strips are probably faster on current
cards, lists are probably faster in the future.

>Of course, you could use the incremental stripping
>idea from the ROAM paper, which works on any
>locally updating mesh including PM.  Since you
>are clearly hoping for coherence in the index-update
>step, this is a cheap way to make pretty good strips.
>Plus it avoids the issue of loosing vertex-cache
>coherence for any but the one mesh you optmized for.

Hmm; the ROAM paper doesn't go into great detail about this;
maybe you could expound?  First, let me make it clear what I
think of as a "good strip" : the whole mesh should be in one
strip.  You can't afford to issue separate draw commands for
3-7 vertex strips.  I find a good stripping for caching, then
just tack them all together with degenerate vertices (duplicated
indices) to make one big strip.  That way I just fire a single
render.  The problem I have with all the VDPM strippers is
that they can't work in this way, and/or might incurr large
non-O(N) costs if they try to.

For example, consider the "skip strips" of El-Sana et. al.
You could keep one big strip and just change indices; that
would work fine and be a pretty fast way to do VDPM.  The
problem is that you could never reduce the triangle count.
In VIPM I can reduce the triangle count by just swapping out
my strip at a specific LOD, because I know exactly what the
mesh is like at that LOD in advance.  With VDPM, I can't
know that, so I can never swap out to a pre-computed simpler
version.  So, how else can I reduce triangle count?  Well,
what El-Sana is run through the strip every frame and eleminate
degenerate triangles.  Obviously, that's unacceptable because
it requires too much shuffling in memory when you cut a triangle.
To avoid that, you could cut your strip into chunks.  Then your
hope is that VDPM just changes triangle count in a few of these
chunks, which are pretty short so that you don't have to do much
shuffling to cut the degeneracies.  Now if you want to take this
and acheive optimality again you'd have to merge up neighboring
chunks whenever they get too small, and combine them if they
get to big.

The problem is this: if your mesh has N triangles, and you keep
a bunch of little strips of target size M, then your CPU cost
to fire the strips at the graphics card is O(N/M).  However, your
CPU cost to update a strip is O(M)  (that's optimistically
assuming that you only change O(1) strips per frame).  To jointly
minimize the two, you must have strips of average length M = O(sqrt(N)),
and the result is that your LOD algorithm is O(sqrt(N)) where N
is the number of triangles visible.

--------------------------------------
Charles Bloom           www.cbloom.com