Re: [Algorithms] VIPM With T&L - what about roam
Brought to you by:
vexxed72
From: Charles B. <cb...@cb...> - 2000-09-14 19:42:59
|
Mark, you have the basic picture right; Tom can clarify the issues with the AGP bus better than I can. First, card caches are getting pretty big; GeForce 2 and Radeon have about 16 vert caches, and future cards will have more. This will mean with you can send 18 triangles with only 12 verts (a 4x4 grid of verts, with one row of the grid already in cache, make a 3x3 grid of triangle pairs). In practice, you can get about one triangle per vert. As for optimizing the cache, Tom and I have talked about this a bit before, the basic technique being to swap out to "prefixes" of optimized indices. This incurs no CPU work, it's just a problem of storing all these prefixes; however, they can be shared, so if you have a lot of model duplication (as we do in games) the waste can be reclaimed. The result is that you can keep all but a small portion of your index chunk cache-optimal at all times. >2) In the simple VIPM scheme, just send three indices >per tri, and verts and tris are listed in the arrays in >progressive split order, the indices need to be updated >for triangles whose vertices moved in any splits >introduced in the frame, which you precompute as a kind >of table lookup of state-change lists. These changes >will generally be scattered through index memory, >leading to bad PC-memory cache behavior. But you >hope that only a tiny fraction of the indices need >to be updated per frame (this is a very ROAM-like >hope ;-). This is actually not a problem for the CPU, because you never ever read from the index list. That means all you ever do is writes, and you can just fire them off, and they get retired asynchronously, and the CPU never stalls. If you want to get fancy you can even tell the CPU cache not to mirror that memory in cache, just write straight to main memory. The wonderful Athlon chip can have 32 outstanding queue'd stores, so this is no problemo. The problem would only arise if you write and read from the same data structure, which we would never do. Note that you must, however, wait a while before rendering after you make your index changes, or your AGP DMA will stall on the CPU stores finishing. If this were the bottleneck, we'd be golden. Let me try to check your bandwidth numbers. Let's imagine that I can make great strips, so I just send an index per triangle, and I'm using the cache well so I just send one vert per triangle, and I'm aiming at 20 MtriHz. First of all, my vertices are 32 bytes (3*4 for the position, 2*4 for the UV's (they're floats),and 3*4 for the normal, cuz I want the card to do my lighting for me), so I'm sending 34 bytes if you count the index = 600 Mb/sec. Ok, so now let's imagine I'm not using the cache perfectly, and I'm using indexed lists, so I need 6 bytes for indices, and I'm sending 1.5 verts per tri = 54 bytes/tri ; so I need just about 1 Gb/sec. This is a bit optimistic to expect from your AGP, but if you cut it in half, you're still getting 10 MtriHz. The point is that the VIPM process did not hurt your triangle rate in the least; any triangle rate you can get without VIPM, you can get with VIPM too. That wasn't true until Tom had the idea for optimized prefixes of indices, but if you combine it with VIPM stripping, it's true for any card. >So...does it make sense >to put some geometry info into graphics-card mem? Probably. Cards just don't/won't do it. Anyway, X-Box will solve this neatly with the unified memory architecture. You can do your VIPM in main memory, and the card will just tear your data from their at breakneck speeds. >If you really want to minimize the number of times >verts are sent, you need to allow much more >index manipulation per frame to optimize, whether >via precomputed state-change lists or through >some yet unknown on-the-fly technique. See above (and search the archives) about switching index prefixes. = >3) In the "stripped" version of VIPM, cover the chunk >with strips (chosen in a particular way?) and fiddle with >the indices just as in case (2). First, any card that can do 20 MTris/sec won't blink at getting rid of the degenerate triangles. Stripped-VIPM only really works with the swap-out technique. You start with a strip for the model. Assuming perfect stripping for the moment, you've cut your index cost to 1/3. Note that the update process is indentical if I arrange my "vsplit/ecoll" structure a bit. Instead of storing a triangle+vert to change, I store an offset in the index list to change. Then the vsplit/ecoll doesn't care if it's strips or lists that are being modified. Ideally, with strips you'd change fewer indices with each vsplit/ecoll, because many indices are shared. Once you've reduced down to some fraction F of the triangles, you switch to a new optimal stripping (which has removed the degenerate triangles and perhaps re-ordered the walk for better caching). In the worst case, you are storing N indices when you could store N*3*F indices. You're also sending N triangles when you could send N*F triangles. I think F = 2/3 is a good number to switch at. Really, with list swapping the stripping and listing methods are very similar. It's just a matter of A. lists are more flexible and friendly, B. strips area bit more compact, C. strips are probably faster on current cards, lists are probably faster in the future. >Of course, you could use the incremental stripping >idea from the ROAM paper, which works on any >locally updating mesh including PM. Since you >are clearly hoping for coherence in the index-update >step, this is a cheap way to make pretty good strips. >Plus it avoids the issue of loosing vertex-cache >coherence for any but the one mesh you optmized for. Hmm; the ROAM paper doesn't go into great detail about this; maybe you could expound? First, let me make it clear what I think of as a "good strip" : the whole mesh should be in one strip. You can't afford to issue separate draw commands for 3-7 vertex strips. I find a good stripping for caching, then just tack them all together with degenerate vertices (duplicated indices) to make one big strip. That way I just fire a single render. The problem I have with all the VDPM strippers is that they can't work in this way, and/or might incurr large non-O(N) costs if they try to. For example, consider the "skip strips" of El-Sana et. al. You could keep one big strip and just change indices; that would work fine and be a pretty fast way to do VDPM. The problem is that you could never reduce the triangle count. In VIPM I can reduce the triangle count by just swapping out my strip at a specific LOD, because I know exactly what the mesh is like at that LOD in advance. With VDPM, I can't know that, so I can never swap out to a pre-computed simpler version. So, how else can I reduce triangle count? Well, what El-Sana is run through the strip every frame and eleminate degenerate triangles. Obviously, that's unacceptable because it requires too much shuffling in memory when you cut a triangle. To avoid that, you could cut your strip into chunks. Then your hope is that VDPM just changes triangle count in a few of these chunks, which are pretty short so that you don't have to do much shuffling to cut the degeneracies. Now if you want to take this and acheive optimality again you'd have to merge up neighboring chunks whenever they get too small, and combine them if they get to big. The problem is this: if your mesh has N triangles, and you keep a bunch of little strips of target size M, then your CPU cost to fire the strips at the graphics card is O(N/M). However, your CPU cost to update a strip is O(M) (that's optimistically assuming that you only change O(1) strips per frame). To jointly minimize the two, you must have strips of average length M = O(sqrt(N)), and the result is that your LOD algorithm is O(sqrt(N)) where N is the number of triangles visible. -------------------------------------- Charles Bloom www.cbloom.com |