Thread: Re: [Algorithms] VIPM With T&L - what about roam (Page 2)
Brought to you by:
vexxed72
From: Daniel R. <Dan...@ho...> - 2000-09-14 01:11:52
|
hi charles, > I think some kind of chunked ROAM or VDPM is perhaps the future; it would > be nice to find an algorithm which can be seamlessly blended from VIPM > (large chunks) to ROAM (small chunks) so that a parameter can be tweaked > which controls the CPU vs GPU loading. The brilliant step in this > direction is still missing; all attempts at it are way too bulky to compete > with VIPM or the minimalist versions of ROAM (ala Treadmarks). > -------------------------------------- > Charles Bloom www.cbloom.com could you be so nice and point me to some sources (or simply explain it yourself) what the (already dead) coder of Treadmarks did. (or is your Treadmarks something different to the game called Threadmarks i remember ...) thx, daniel aka re...@ma...a aka SirLeto |
From: Tom F. <to...@mu...> - 2000-09-13 10:37:02
|
Yes, for things that like strips (though I'm still hopeful that I can persuade the bit of hardware to do something other than strips, maybe by switching to fans half way through or something funky - it's a pain because the rasteriser is basically just a fast Permedia2, but it doesn't have the nice exposed front-end interface of the P2), the original skipstrips is probably good. Though there's no reason you can't use indexed strips to get better DMA throughput - I have a cunning scheme to get pretty good indexing working on this hardware, even though it can't do _real_ demand-loaded caching itself. Using something like indexed skipstrips and power-of-two precalculated models seems like it should be a good thing to try. Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. > -----Original Message----- > From: Brian Marshall [mailto:bma...@ra...] > Sent: 13 September 2000 07:45 > To: gda...@li... > Subject: RE: [Algorithms] VIPM With T&L - what about roam > > > I've been thinking about this more. The requirements > for the hardware I'm > working with are somewhat different - I get maximum benefit > from very long > strips, since the hardware has a very different caching > scheme to G-Force > etc. > > I'm currently edging towards trying a combination of > skip strips and your > suggestion Tom, of precomputed strips for various levels. The > idea would be > to precompute good sets of strips at reasonable values, and > use skip strips > to maintain the strips during edge collapses. > During PM computation you start with doing a full > stripification of the > model. Then as you do each collapse you rate the current set > of strips - how > many degeneracies, how long etc. So when the strip quality > reaches a user > set threshold you regenerate the strips. That way the user can set the > trade-off between memory usage (for extra sets of strips.) > and rendering > efficiency. For an ultimate version you could recompute a new > set of strips > for each edge collapse to compare the efficiency of the > current skip-strips > against. Then you can decide exactly when to issue new > strips. Not something > to do when the art team are working on the model, but should > work great > overnight. > > There are a couple of problems I can see: > 1. The edge collapse records get more complex - you need to > issue sets of > strips as well. > 2. The runtime memory state required is somewhat larger - you > may need 2 > sets of strips (the second having excess degenerate tri's removed for > efficiency), as well as more complex collapse records and > extra data for the > sets of strips to restart things. > > -Brian. > > > Cool. I'll have a look at the second one of those. The first one > > (skipstrips) was discussed here a while back, and led to me > burbling this, > > which bears repeating for those who recently joined us > (especially as > > someone asked for it): |
From: Tom F. <to...@mu...> - 2000-09-13 19:56:36
|
> From: Mark Duchaineau [mailto:duc...@ll...] > > > Tom, > > As a ROAM crusader I must respond ;-)... But of course... > ROAM works just fine on arbitrary geometry. We mentioned it in the > ROAM paper (but didn't implement it). Since then I've implemented > that and it works great using a forest of triangle bintrees. I was actually going to add an aside that maybe ROAM would work for arbitrary geometry, but then thought "nah, if it did, someone would have tried it by now, so there must be some gotcha I haven't spotted". Interesting to note that there isn't a huge gotcha. > Cover your > surface with "patches" that are 2^n + 1 verts on a side, > where adjacent > patches match vertices on their mutual edge, and you are set to go. > The trick is to convert your geometry to "patches" like this. There > is the MAPS algorithm from SIGGRAPH (Lee et al, 98), plus a new > shrink wrap technique for really wild geometry like turbulent CFD > isosurfaces by Bertram et al in Vis 00. Go to > http://graphics.cs.ucdavis.edu/Publications-2000.html > and look for "Bicubic Subdivision-Surface Wavelets for Large-Scale > Isosurface Representation and Visualization." Cool. Hoppe also has some algos for finding suitable base meshes for reparameterisations. Actually, about half his papers use them. So it looks like finding suitable "patches" is not rocket science (or at least, it's well-understood rocket-science). People who have read my talk on what I do to get my high-tri models (start with low-tri models, subdiv surface & displacement map - at http://www.muckyfoot.com/downloads/tom.shtml), will note that since I start with a low-tri mesh, and then tesselate (moderately) evenly, that stage is ready-done for me! > The ROAM dual-queue idea works great on Directed Acyclic > Graph (DAG) style hierarchies, a special case of which > is VDPM. This was shown in a paper by Bernadini and Schikore > "Dynamic viewpoint-dependent level of detail of large triangle meshes" > presented at The IASTED International Conference on Computer > Graphics and Imaging, Halifax, Canada, June 1-4, 1998. An online > preprint is here: http://www.llnl.gov/tid/lof/ (type Schikore as the > name to search on, then click on the last item that comes up). > Additionally, the priority deferral and incremental stripping ideas > work fine for DAGs. The one thing that needs a small re-work is > the frustum culling using halfplane labels and recursion. DAGs > don't have the simple nesting relationship that triangle bintrees do. > > VIPM I see as okay for large collections of little objects, > if you ignore > the fact that you are roughly doubling the transform and download > work for back-facing sections of the object (VD methods can > easily eliminate those from ever hitting the graphics hardware). > If your "little" object isn't, like a ship you decide to walk > into, then you > really need to do view-dependent processing on finer-grained units > than whole objects. Further, you need Yet Another Hierarchy (YAH) > to do frustum culling, and other macroscopic logic to decide how > to balance the LODs on all the objects. So the simlicity of the > VIPM vertex selection for a single object isn't buying you all that > much in overall system simplicity. Agreed, for big things you walk into, you need to split into chunks. But then each chunk is going to be fairly big on screen (a wall, a pillar, a computer console, etc), so the number of tris in it is going to be high when looking at it, so the granularity of the work you are doing for VIPM is still small (i.e. a VIPM "work unit" per couple of thousand tris). However, note that in the case of a thing you walk into, you're going to need some sort of visibility culling, e.g. portals or something (even boring old frustum culling, so you don't load gazillions of textures, needs you to split stuff into some sort of chunk that it can test), so you're still going to need YAH of one kind or another. Also, since lots of these chunks are going to be very similar (we wish we had time to make every bit of a building unique, but ten-year development times are frowned upon :-), and replicating them with VIPM is (a) a doddle and (b) very efficient (you share the vertex and split data, and for some implementations of VIPM you also share large chunks of index data as well). So it's YAH whichever way you do it really. There is a good case for trying to merge more distant collections of objects into one (e.g. make a whole room a VIPM chunk, so that when viewing it from the other end of a long corridor, it doesn't chew too much time), but I'm sure a fairly simple hierarchy of chunks would handle this just as effectively as ROAM. So you have the room as a chunk with is VIPM'd, and when it gets to a certain number of tris, it splits into five different VIPM chunks (that have the same polygon representation at that LOD), and so on, so that when you get close enough for the view-dependent differences to be significant, they are all separate objects as above. This "clumping" needs to be done with some heuristic, and needs to interface with your YAH (visibility culling or otherwise) wether you are are ROAMing or VIPMing, so that seems to be fairly equivalent work to me. Incidentally "VD methods can easily eliminate those from ever hitting the graphics hardware" - replace "easily" with "at no extra effort" and I'll agree :-) But I think the effort per object (or vertex or whatever dodgy benchmark we use) is going to be higher for VDPM than for VIPM. But that's the very core of our religions, so we'll leave that for now. :-) > Correct me if I'm wrong, but in your VIPM code you still need > to compute > the actual strips or indexing on the fly each frame? No. The whole point of VIPM is that you calculate the strips/list/whatever offline. You then make a low number (e.g. five) modifications to that data each frame, according to whether an edge collapses or expands. If an edge doesn't change, then (a) the strip/list information for it doesn't get touched by the CPU, and (b) the edge isn't even considered by the CPU. Basically, there is a list of collapses/expands (which it is depends which way you traverse it). The object has a "current position" pointer. And each frame, the CPU decides how far & in which direction it needs to move the pointer. When moving the pointer, the collapse/expand entries simply say which indices need to be changed, and to what value, so it is a simple memory-copying operation on indices to move the "current position". No thinking required byt he CPU, apart from deciding how many tris it wants to draw for the VIPM chunk - just memory bandwidth really. If you're using indexed lists, the tris are also stored in collapse order, so collapsed tris just get dropped off the end of the list. If using strips, then tris are collapsed out of the middle of strips just by making them degenerate (which hardware bins very quickly, since it spots that two of the three indices are the same). Vertices are also stored in collapse order, so unused verts just fall of the end of the list (i.e. you just tell the T&L pipeline to T&L fewer verts). So if the object doesn't change in LoD (which is usually just a cheesy "size of bounding spehere over distance from viewer"-type heuristic to be honest), then the CPU does no work at all, just points the graphics API at the previous frame's data. If collapses/expands are needed, then the CPU doesn't do any thinking - it just moves (typically) 3-4 indices (WORDs) per collapse/expand around in memory, increments or decrements the tri count by two, increments or decrements the vertex count by one, and points the graphics API at the data. > You can > "precompute" in a coarse-grained way only if you keep something like > power-of-two static LODs, in which case VIPM isn't all that different > from traditional static LODs circa 1976, except that you cut > the vertex > storage in half in some cases. If you use skip-strips or > other techniques > you end up doing work on a fine-grained level, and at that point > you might as well put some VD smarts in. Variants of VIPM recognise that after quite a few collapses, you're going to either have a strip with lots of degenerate tris, or a list with bad vertex caching, and so store several versions (e.g. powers of two numbers of tris) as "baselines" that then get mangled by the above, to keep the number of degenerate/badly cached tris to a sensible limit. But that swapover rarely takes place, and when it does it's always at the same point in the collapse/expand order, so it doesn't cost anything but the extra storage. So the precomputed versions are just to improve the hardware's efficiency a bit - they do not alter the fact that (assuming the camera moves slowly enough) you can have just two tris removed/added per frame all the way from ten to ten million tris. The switch from one version to the next power of two is done when the two describe identical geometries, just in a different strip/list ordering. > All that said, it is certainly worth exploring *somewhat* > coarser-grained > view dependence. Fine-grained view-dependent techniques using ANY > algorithm just can't keep up with graphics hardware (unless > you move veeeery slowly ;-). Hoppe's 1998 IEEE Vis paper is > a step in the right direction, but has a lot of problems and > only works > with terrain. See: > http://www.research.microsoft.com/~hoppe/#svdlod > > --Mark D. |
From: Mark D. <duc...@ll...> - 2000-09-13 21:41:52
|
Tom Forsyth wrote: > > From: Mark Duchaineau [mailto:duc...@ll...] > >[snip]> The ROAM dual-queue idea works great on Directed Acyclic > > Graph (DAG) style hierarchies, a special case of which > > is VDPM. [snip] > > > > VIPM I see as okay for large collections of little objects, > > if you ignore > > the fact that you are roughly doubling the transform and download > > work for back-facing sections of the object (VD methods can > > easily eliminate those from ever hitting the graphics hardware). > > If your "little" object isn't, like a ship you decide to walk > > into, then you > > really need to do view-dependent processing on finer-grained units > > than whole objects. Further, you need Yet Another Hierarchy (YAH) > > to do frustum culling, and other macroscopic logic to decide how > > to balance the LODs on all the objects. So the simlicity of the > > VIPM vertex selection for a single object isn't buying you all that > > much in overall system simplicity. > > Agreed, for big things you walk into, you need to split into chunks. But > then each chunk is going to be fairly big on screen (a wall, a pillar, a > computer console, etc), so the number of tris in it is going to be high when > looking at it, so the granularity of the work you are doing for VIPM is > still small (i.e. a VIPM "work unit" per couple of thousand tris). We seem to agree--eek! So the questions are: (1) how big are the chunks, (2) how are seams dealt with, (3) how do you organize VD work on these chunks. The VIPM is slick for one subtask, sliding from coarse to fine with essentially no compute or dowloads within a chunk. This is really orthogonal to what ROAM is all about, and can be used within a ROAM- like system very nicely. ROAM is a good way to get seamless chunks and do the higher-level VD LOD selections. I can see VIPM ideas working great on the "units" that ROAM is optimizing. But I want to see something better than "keep the seams at highest resolution"...which is what Hoppe proposed in Vis 98. > > > However, note that in the case of a thing you walk into, you're going to > need some sort of visibility culling, e.g. portals or something (even boring > old frustum culling, so you don't load gazillions of textures, needs you to > split stuff into some sort of chunk that it can test), so you're still going > to need YAH of one kind or another. Also, since lots of these chunks are > going to be very similar (we wish we had time to make every bit of a > building unique, but ten-year development times are frowned upon :-), and > replicating them with VIPM is (a) a doddle and (b) very efficient (you share > the vertex and split data, and for some implementations of VIPM you also > share large chunks of index data as well). So it's YAH whichever way you do > it really. > YAHp, it's true. But lets keep the number of hierarchies to a minimum to keep our sanity. I think we agree here... And yes, we need realtime general-purpose visibility culling integrated with the VD LOD system, and that is *hard*... Special cases like rooms are solved well for static geometry, but gets a little tricky with geometry dynamically getting approximated with a VD optimizer. More general situations are just not handled in realtime yet. > > There is a good case for trying to merge more distant collections of objects > into one (e.g. make a whole room a VIPM chunk, so that when viewing it from > the other end of a long corridor, it doesn't chew too much time), but I'm > sure a fairly simple hierarchy of chunks would handle this just as > effectively as ROAM. So you have the room as a chunk with is VIPM'd, and > when it gets to a certain number of tris, it splits into five different VIPM > chunks (that have the same polygon representation at that LOD), and so on, > so that when you get close enough for the view-dependent differences to be > significant, they are all separate objects as above. > > This "clumping" needs to be done with some heuristic, and needs to interface > with your YAH (visibility culling or otherwise) wether you are are ROAMing > or VIPMing, so that seems to be fairly equivalent work to me. How to "clump" well is tricky, or at least I haven't heard a good solution yet. The main argument I have is that it is better to break geometry into smaller clumps (the smaller the better from an accuracy point of view), so that the non-uniform view-dependent scaling of errors gets made more uniform, and thus the VIPM is locally close to what you would have gotten with a VD method. Also smaller clumps give better frustum culling and backface culling and visibility culling... The great challenge is to make the clumps as small as possible while having the optimizer keep up with the graphics hardware. > > > Incidentally "VD methods can easily eliminate those from ever hitting the > graphics hardware" - replace "easily" with "at no extra effort" and I'll > agree :-) But I think the effort per object (or vertex or whatever dodgy > benchmark we use) is going to be higher for VDPM than for VIPM. But that's > the very core of our religions, so we'll leave that for now. :-) You will have to fill me in on what you are doing at the macro level for me to compare/contrast with ROAM-like approaches. ROAM does the top-level optimization in the "theoretical optimum" time, so it would be surprising if you could beat the accuracy vesus compute time tradeoff of ROAM. VIPM sounds great for the "low level" optimization. > > > > Correct me if I'm wrong, but in your VIPM code you still need > > to compute > > the actual strips or indexing on the fly each frame? > > No. The whole point of VIPM is that you calculate the strips/list/whatever > offline. You then make a low number (e.g. five) modifications to that data > each frame, according to whether an edge collapses or expands. If an edge > doesn't change, then (a) the strip/list information for it doesn't get > touched by the CPU, and (b) the edge isn't even considered by the CPU. > > Basically, there is a list of collapses/expands (which it is depends which > way you traverse it). The object has a "current position" pointer. And each > frame, the CPU decides how far & in which direction it needs to move the > pointer. When moving the pointer, the collapse/expand entries simply say > which indices need to be changed, and to what value, so it is a simple > memory-copying operation on indices to move the "current position". No > thinking required byt he CPU, apart from deciding how many tris it wants to > draw for the VIPM chunk - just memory bandwidth really. > > If you're using indexed lists, the tris are also stored in collapse order, > so collapsed tris just get dropped off the end of the list. If using strips, > then tris are collapsed out of the middle of strips just by making them > degenerate (which hardware bins very quickly, since it spots that two of the > three indices are the same). Okay--I hadn't known about the tris index order. Very Cool! I hope this really works on current drivers/hardware though... Since you are shifting the index range around, a dumb driver would not understand that it doesn't need to do anything. But this is really slick--it solves the issue of how to smoothly slide between to "chunk" detail levels at essentially no cost. The degenerate strips is an issue--in fact I can't picture a strip surviving after one of its vertices goes away--don't you have to change the indexing? The only way I see this working is if you don't strip but just send three indices per tri in your index arrays. Getting the vertex ordering done "right" sounds like an interesting issue for VIPM. I'll be curious to see what ideas come of that. --Mark D. |
From: Sam K. <sa...@ip...> - 2000-09-13 23:17:04
|
Rui, Aha, you've spotted the flaw in my plan.. :) Pop Pop Pop, Don't you just love it. I like roam, its quick and you don't have to concern yourself with dirty VIPM patch-joins its all so seamless. sam -----Original Message----- From: Rui Ferreira <rui...@mo...> To: gda...@li... <gda...@li...> Date: 13 September 2000 10:19 AM Subject: Re: [Algorithms] VIPM With T&L - what about roam >A static array of vertices with dynamic indexing sounds interesting, and I >thought on that too, especially caching stuff on the lower levels patches, >but the problem is; how do guys handle geomorphing with this scheme ??! > >Personally, I dislike roam, except its basic principles like the binary >triangles concept, diamonds, etc, but what you get with roam, because it >doesn't reset its mesh every time, is popping when a new child is split, you >can minimize it with a morph ratio that takes into account the distance to >the "to be split" node, and such, but those are all fake approximations and >you will end up with pop artifacts sooner or later. > >True geomorphing can only be done if you set your morphing values *per node* >and based on your split metric. Only then you can have a smooth >interpolation from one node base edge height to the child's real height from >frame to frame. > >If you "lock" any subtree and cache away its vertices you just lose >geomorphing.... again i can see ways to minimize this... but... you have a >mesh, you have vertices connecting edges, you either arrange for your >vertices positions to translate smoothly or you don't !... > >I'm a one true believer in top-down, recurse, apply metric, "split-only" >algorithms for this very reason. > >-Rui Ferreira > >_______________________________________________ >GDAlgorithms-list mailing list >GDA...@li... >http://lists.sourceforge.net/mailman/listinfo/gdalgorithms-list |
From: Tom F. <to...@mu...> - 2000-09-14 10:48:27
|
> From: Mark Duchaineau [mailto:duc...@ll...] > > > Tom Forsyth wrote: > > > No. The whole point of VIPM is that you calculate the > strips/list/whatever > > offline. You then make a low number (e.g. five) > modifications to that data > > each frame, according to whether an edge collapses or > expands. If an edge > > doesn't change, then (a) the strip/list information for it > doesn't get > > touched by the CPU, and (b) the edge isn't even considered > by the CPU. > > > > Basically, there is a list of collapses/expands (which it > is depends which > > way you traverse it). The object has a "current position" > pointer. And each > > frame, the CPU decides how far & in which direction it > needs to move the > > pointer. When moving the pointer, the collapse/expand > entries simply say > > which indices need to be changed, and to what value, so it > is a simple > > memory-copying operation on indices to move the "current > position". No > > thinking required byt he CPU, apart from deciding how many > tris it wants to > > draw for the VIPM chunk - just memory bandwidth really. > > > > If you're using indexed lists, the tris are also stored in > collapse order, > > so collapsed tris just get dropped off the end of the list. > If using strips, > > then tris are collapsed out of the middle of strips just by > making them > > degenerate (which hardware bins very quickly, since it > spots that two of the > > three indices are the same). > > Okay--I hadn't known about the tris index order. Very Cool! > I hope this > really works on current drivers/hardware though... Since you > are shifting > the index range around, a dumb driver would not understand that it > doesn't need to do anything. Not quite sure what you mean. Almost all drivers these days don't even look at the indices you're sending (annoying Win2k validation aside) - they just dump them to the card and let the hardware do the de-indexing for you. And yes, it does indeed work on current hardware - the drivers just point the hardware DMA streams at the index and vertex lists and say "do it". > But this is really slick--it solves the > issue of how to smoothly slide between to "chunk" detail levels > at essentially no cost. The degenerate strips is an > issue--in fact I can't > picture a strip surviving after one of its vertices goes away--don't > you have to change the indexing? The only way I see this working is > if you don't strip but just send three indices per tri in > your index arrays. As well as no longer T&Lling the vertex (you don't actually do anything to the vertex, it just falls off the end of the transformed list, because you tell the API not to transform as many), you also change the indices that refer to that vertex. You change them to refer to the vertex that conceptually the edge collapsed to. So if you have a strip 1,2,3,4,5,6: 2---4---6 |\ |\ | | \ | \ | |* \| \| 1---3---5 And the edge 1-3 collapses, so that vertex 3 collapses to vertex 1, you modify index 3 to point to the same vertex as index 1, and then you get: 2---4---6 || / \ | || / \ | ||/ \| 1-------5 The double vertical lines are just to show that the triangle marked (*) has now become degenerate, since two of its vertices refer to vertex 1. The hardware will bin that very quickly - it's not a significant overhead to have it lying around unless the number of degenerates grows large (e.g. twice as many degen as non-degen or something like that). Note that vertex 3 will not actually be the third vertex in the list - it will of course be the last, so it can fall off the end when the collapse happens. > Getting the vertex ordering done "right" sounds like an interesting > issue for VIPM. I'll be curious to see what ideas come of that. Ah, well that's the trick. But at least it's all done offline. Algorithms like Quadric Error Metrics (Garland & Heckbert: http://graphics.cs.uiuc.edu/~garland/papers/quadrics.pdf) (and the derivative by Hoppe) seem to work extremely well - I've been very pleased with the results. > --Mark D. Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. |
From: Tom F. <to...@mu...> - 2000-09-14 11:32:27
|
> From: Mark Duchaineau [mailto:duc...@ll...] > > Hmmm...I just skimmed your web page on VIPM and that explains a lot. > I am concerned about the three indices per tri rather than > what you get > with strips (for infinitely long strips this gets down to 1). > We are talking > dynamic loads to graphics-card memory in general, as you swap in and > out chunks, and taking up room on the limited graphics > memory, and depending > on the chip taking additional bandwidth to access those > indices. In short, > I think in general you need to minimize the number of indices > that you send to > > the hardware or store there (same for vert coords and texture > coords etc). > The O(number of new verts) applies mainly to PC memory unless you have > really fine-grained ways to send a few more verts/indices > here and there to > the graphics card. Unless you are assuming that the indices > are all getting > shipped over the AGP bus every frame? Also, how do you manage to > keep the array continguous in graphics card memory if you > don't allocate > enough space there to ship the whole thing? But these are a quibbles. In practice the extra bandwidth of the indices is pretty tiny. One DWORD per tri? Peanuts, considering your average fairly efficiant vertex-caching scheme will need to load a whole vertex (around 32 DWORDS) per tri. But versions that use strips have certainly been considered on this list (the "skipstrips" stuff). They in turn suffer because as the number of tris decreases, the number of indices doesn't. With list VIPM, tris that get removed fall off the end of the index list and never get considered. With skipstrips, they are still there, they just become degenerate. The real point of skipstrips is not to save memory/bandwidth, but (a) to increase vertex cache efficiency (which the lists method isn't great at), and to be friendly to bits of hardware (specifically a certain console) that like strips, but not much else. And yes, on all current implementations of PC hardware, both the indices and the vertices get shipped over the AGP bus every frame. Even the GeForce, which in theory can use video memory vertex data, in practice doesn't. The reason is that if you put your vertices in video memory, they just fight for precious pixel & texel bandwidth, and your nice wide AGP4x bus sits there idle. There are very few, if any, cases where the D3D driver (which is free to choose where it puts its vertices) decides to put them in videomemory. This is also true of the Radeon AFAIK, and likely to stay that way for the foreseeable future. > The term VIPM is misleading and restrictive--the concept > deserves better. > The idea really has to do with sending things *within* chunks > in a "static" > progressive order, taking advantage of caching effects to allow only > an array-range update or trickle of geometry each frame, and > relying on some > other VD processing at the macro level. The PM is the restrictive > part--you can do this with *any* progressive stream that doesn't move > vertices. The results of doing a split-only ROAM without > screen-projecting > the error produces such a progressive sequence, as does the > more general > queue-based refinement of a DAG style hierarcy. VIPM is misleading in > that it makes one think "view independent" for the big > picture, which is > not the intention. It is a progressive vertex array, really, > and you would > like to have a few thousand of them active, with a few > hundred verts each. > Ideally. I'd be curious what current graphics hardware can handle. Indeed - the "VI" bit only refers to a small chunk at a time (I aim for around a thousand tris or so per batch). I am currently getting around 2-3 million tris/sec on a GeForce1+Athlon600, though my vertex cache efficiency is currently poor, and the GeForce1 is known to have ... um ... "issues" with indexed lists (I believe the driver has to do some work before sending them down to the chip - this is unique to the GeForce1 - it's fixed in GF2, and indeed every other chip for the past few years). I really need to try the code on a GF2 or Radeon to see what the real speed is. But I do know that the CPU load due to the processing I have to do because of the VIPM is miniscule. The current bottleneck is the driver working around the hardware features. Using skipstrips would probably be much better for a GeForce1, since it can handle indexed strips natively, but I haven't found the time to do that yet. > --Mark D. Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. |
From: Tony C. <to...@mi...> - 2000-09-14 12:54:42
|
>In practice the extra bandwidth of the indices is pretty tiny. One DWORD per >tri? Peanuts, considering your average fairly efficiant vertex-caching >scheme will need to load a whole vertex (around 32 DWORDS) per tri. You mean 32 BYTEs not DWORDs, right Tom? Still much bigger than the index data though. Your other comments about lists versus strips for VIPM seem right on the money. Tony Cox - DirectX Luminary Windows Gaming Developer Relations Group http://msdn.microsoft.com/directx |
From: Tom F. <to...@mu...> - 2000-09-14 14:02:39
|
Oops. Yes. Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. > From: Tony Cox [mailto:to...@mi...] > > >In practice the extra bandwidth of the indices is pretty > tiny. One DWORD > per > >tri? Peanuts, considering your average fairly efficiant > vertex-caching > >scheme will need to load a whole vertex (around 32 DWORDS) per tri. > > You mean 32 BYTEs not DWORDs, right Tom? Still much bigger > than the index > data though. Your other comments about lists versus strips > for VIPM seem > right on the money. > > Tony Cox - DirectX Luminary > Windows Gaming Developer Relations Group > http://msdn.microsoft.com/directx |
From: Mark D. <duc...@ll...> - 2000-09-14 18:07:09
|
Tom, So the picture you and Charles are painting is this (see if I get this right): 1) send verts attrribs (x,y,z,u,v,...) and index arrays across the AGP bus each frame, and let the textures and frame buffer dominate the on-card memory bandwidth. If all your lighting is done in textures/normal maps, and if you use tri bintree meshes per surface "patch", then the info per vert is (x,y,z): 3 floats=12 bytes, plus (u,v): 2 shorts=4 bytes, total=16 bytes. Since you are sending each vert across the AGP bus, and there is only a dinky little cache on the other end, you have to be very careful to arrange the order the verts are indexed to avoid sending them multiple times. This is of coarse dependent on what the hardware's replacement strategy is and what the cache size is. 2) In the simple VIPM scheme, just send three indices per tri, and verts and tris are listed in the arrays in progressive split order, the indices need to be updated for triangles whose vertices moved in any splits introduced in the frame, which you precompute as a kind of table lookup of state-change lists. These changes will generally be scattered through index memory, leading to bad PC-memory cache behavior. But you hope that only a tiny fraction of the indices need to be updated per frame (this is a very ROAM-like hope ;-). The state-change info takes 14 bytes per vert according to Charles' web page, so you are almost doubling the mem per vert. If your progressive scheme was tri bintree split-only order, then there is no additional storage for index changes, you just know what they are based on which diamonds (which correspond one-to-one with verts) are split (Charles alluded to this on his VIPM page). Per frame index transmission across the AGP bus per tri is 3 shorts=6 bytes. So if you are very lucky and send each vertex (16 bytes) once, then on average you have 2 tris per vert and so 16+12=28 bytes per vert including indices. If you are at "infinite strip optimum" you get each vertex sent twice, leading to 2*16+12=44 bytes sent per vert including indices. Let's imagine you using a graphics chip capable of 30M tris/sec, and you want to actually achieve this (ha!): this would mean pumping 840-1320M bytes/sec over the bus. Okay, the bus can handle this in theory on AGP4x (1GB/sec) on the wildly optimistic side, but not in any real situation. Also, this is sucking up a big chunk of your PC memory system bandwith *continuously*, so the rest of your app is going to take a performance nose-dive. So...does it make sense to put some geometry info into graphics-card mem? Of course the optimistic scenario requires extreme care in the order the tris are listed and indexed. Since this is not a single static mesh, you have to come up with index orders that are best for the whole range of surfaces you get, not just one. If you really want to minimize the number of times verts are sent, you need to allow much more index manipulation per frame to optimize, whether via precomputed state-change lists or through some yet unknown on-the-fly technique. 3) In the "stripped" version of VIPM, cover the chunk with strips (chosen in a particular way?) and fiddle with the indices just as in case (2). But keep drawing the same strips. This means you send the index data for the whole chunk at full res. This limits how big a swing in resolution you can have in a chunk before this cost dominates. If you are at full res then the index cost is 1/3 of case (2). If you are at 1/3 res then the cost is the same as case (2). Since you are trying to force strip order, then you will generally do no better than case (2) for vertex on-card cache coherence, and probably worse. The card has to expend some effort in theory to eliminate the degenerate tris, although this could be negligible for a good card/driver. I don't see this as either a big win or big loss versus the simple scheme, so I would tend to go simple. Of course, you could use the incremental stripping idea from the ROAM paper, which works on any locally updating mesh including PM. Since you are clearly hoping for coherence in the index-update step, this is a cheap way to make pretty good strips. Plus it avoids the issue of loosing vertex-cache coherence for any but the one mesh you optmized for. --Mark D. Tom Forsyth wrote: > Oops. Yes. > > Tom Forsyth - Muckyfoot bloke. > Whizzing and pasting and pooting through the day. > > > From: Tony Cox [mailto:to...@mi...] > > > > >In practice the extra bandwidth of the indices is pretty > > tiny. One DWORD > > per > > >tri? Peanuts, considering your average fairly efficiant > > vertex-caching > > >scheme will need to load a whole vertex (around 32 DWORDS) per tri. > > > > You mean 32 BYTEs not DWORDs, right Tom? Still much bigger > > than the index > > data though. Your other comments about lists versus strips > > for VIPM seem > > right on the money. > > > > Tony Cox - DirectX Luminary |
From: Angel P. <ju...@bi...> - 2000-09-15 15:10:37
|
What about having an array with the vertex collapses ( the N-th element storing the new index of the N-th vertex after the collapse ) with the verts sorted in discard order and several static LODs (each LOD with ~ half the verts of the next one ) of the trilist with the triangles stored in strip order. The algho should work like this: 1. Pick the number of used verts from a lookup table using the distance and screen area. 2. Pick the LOD and the trilist for that LOD. 3. For each triangle from the trilist - collapse each vertex untill it's index becomes smaller than the number of verts, if the triangle does not have degenerate edges then put its vert indexes in the index list. You will preform less than one split per vertex average, the triangles will always be in roughly strip order, the memory requirements will be slightly lower (no more 14 bytes per vsplit ) than for the vsplit approach, it will handle worst-case situations (huge landcape showing behind the door ) much more gracefully than updating the PM each frame, it's much easier to understand and and simpler to implement. Here's some code I just wrote (so it probably has bugs): int currentNumVerts = GetNumVertsFromLookupTable( ScreenArea, ObjectDistance );//with interpolation int currentLOD; while( LODs[currentLOD].NumVerts < currentNumVerts ) currentLOD++; numVerts=0; for( int i=0; i<TriListLODs[currentLOD]->NumTris; i++ ) { CIndexedTriangle currentTri = &TriListLODs[currentLOD]->Tris[ i ]; if( CurrentTri->NumVertsBeforeCollapse<currentNumVerts ) continue; //quickly reject this triangle int index1 = currentTri->Index1; while( collapseArray[ index1 ]>currentNumVerts ) index1=collapseArray[ index1 ].collapsedVertIndex; int index2 = currentTri->Index2; while( collapseArray[ index2 ]>currentNumVerts ) index1=collapseArray[ index2 ].collapsedVertIndex; int index3 = currentTri.Index3; while( collapseArray[ index3 ]>currentNumVerts ) index3=collapseArray[ index3 ].collapsedVertIndex; vertexIndexes[ numVerts ] = index1; vertexIndexes[ numVerts+1 ] = index2; vertexIndexes[ numVerts+2 ] = index3; numVerts+=3; } |
From: Charles B. <cb...@cb...> - 2000-09-14 19:42:59
|
Mark, you have the basic picture right; Tom can clarify the issues with the AGP bus better than I can. First, card caches are getting pretty big; GeForce 2 and Radeon have about 16 vert caches, and future cards will have more. This will mean with you can send 18 triangles with only 12 verts (a 4x4 grid of verts, with one row of the grid already in cache, make a 3x3 grid of triangle pairs). In practice, you can get about one triangle per vert. As for optimizing the cache, Tom and I have talked about this a bit before, the basic technique being to swap out to "prefixes" of optimized indices. This incurs no CPU work, it's just a problem of storing all these prefixes; however, they can be shared, so if you have a lot of model duplication (as we do in games) the waste can be reclaimed. The result is that you can keep all but a small portion of your index chunk cache-optimal at all times. >2) In the simple VIPM scheme, just send three indices >per tri, and verts and tris are listed in the arrays in >progressive split order, the indices need to be updated >for triangles whose vertices moved in any splits >introduced in the frame, which you precompute as a kind >of table lookup of state-change lists. These changes >will generally be scattered through index memory, >leading to bad PC-memory cache behavior. But you >hope that only a tiny fraction of the indices need >to be updated per frame (this is a very ROAM-like >hope ;-). This is actually not a problem for the CPU, because you never ever read from the index list. That means all you ever do is writes, and you can just fire them off, and they get retired asynchronously, and the CPU never stalls. If you want to get fancy you can even tell the CPU cache not to mirror that memory in cache, just write straight to main memory. The wonderful Athlon chip can have 32 outstanding queue'd stores, so this is no problemo. The problem would only arise if you write and read from the same data structure, which we would never do. Note that you must, however, wait a while before rendering after you make your index changes, or your AGP DMA will stall on the CPU stores finishing. If this were the bottleneck, we'd be golden. Let me try to check your bandwidth numbers. Let's imagine that I can make great strips, so I just send an index per triangle, and I'm using the cache well so I just send one vert per triangle, and I'm aiming at 20 MtriHz. First of all, my vertices are 32 bytes (3*4 for the position, 2*4 for the UV's (they're floats),and 3*4 for the normal, cuz I want the card to do my lighting for me), so I'm sending 34 bytes if you count the index = 600 Mb/sec. Ok, so now let's imagine I'm not using the cache perfectly, and I'm using indexed lists, so I need 6 bytes for indices, and I'm sending 1.5 verts per tri = 54 bytes/tri ; so I need just about 1 Gb/sec. This is a bit optimistic to expect from your AGP, but if you cut it in half, you're still getting 10 MtriHz. The point is that the VIPM process did not hurt your triangle rate in the least; any triangle rate you can get without VIPM, you can get with VIPM too. That wasn't true until Tom had the idea for optimized prefixes of indices, but if you combine it with VIPM stripping, it's true for any card. >So...does it make sense >to put some geometry info into graphics-card mem? Probably. Cards just don't/won't do it. Anyway, X-Box will solve this neatly with the unified memory architecture. You can do your VIPM in main memory, and the card will just tear your data from their at breakneck speeds. >If you really want to minimize the number of times >verts are sent, you need to allow much more >index manipulation per frame to optimize, whether >via precomputed state-change lists or through >some yet unknown on-the-fly technique. See above (and search the archives) about switching index prefixes. = >3) In the "stripped" version of VIPM, cover the chunk >with strips (chosen in a particular way?) and fiddle with >the indices just as in case (2). First, any card that can do 20 MTris/sec won't blink at getting rid of the degenerate triangles. Stripped-VIPM only really works with the swap-out technique. You start with a strip for the model. Assuming perfect stripping for the moment, you've cut your index cost to 1/3. Note that the update process is indentical if I arrange my "vsplit/ecoll" structure a bit. Instead of storing a triangle+vert to change, I store an offset in the index list to change. Then the vsplit/ecoll doesn't care if it's strips or lists that are being modified. Ideally, with strips you'd change fewer indices with each vsplit/ecoll, because many indices are shared. Once you've reduced down to some fraction F of the triangles, you switch to a new optimal stripping (which has removed the degenerate triangles and perhaps re-ordered the walk for better caching). In the worst case, you are storing N indices when you could store N*3*F indices. You're also sending N triangles when you could send N*F triangles. I think F = 2/3 is a good number to switch at. Really, with list swapping the stripping and listing methods are very similar. It's just a matter of A. lists are more flexible and friendly, B. strips area bit more compact, C. strips are probably faster on current cards, lists are probably faster in the future. >Of course, you could use the incremental stripping >idea from the ROAM paper, which works on any >locally updating mesh including PM. Since you >are clearly hoping for coherence in the index-update >step, this is a cheap way to make pretty good strips. >Plus it avoids the issue of loosing vertex-cache >coherence for any but the one mesh you optmized for. Hmm; the ROAM paper doesn't go into great detail about this; maybe you could expound? First, let me make it clear what I think of as a "good strip" : the whole mesh should be in one strip. You can't afford to issue separate draw commands for 3-7 vertex strips. I find a good stripping for caching, then just tack them all together with degenerate vertices (duplicated indices) to make one big strip. That way I just fire a single render. The problem I have with all the VDPM strippers is that they can't work in this way, and/or might incurr large non-O(N) costs if they try to. For example, consider the "skip strips" of El-Sana et. al. You could keep one big strip and just change indices; that would work fine and be a pretty fast way to do VDPM. The problem is that you could never reduce the triangle count. In VIPM I can reduce the triangle count by just swapping out my strip at a specific LOD, because I know exactly what the mesh is like at that LOD in advance. With VDPM, I can't know that, so I can never swap out to a pre-computed simpler version. So, how else can I reduce triangle count? Well, what El-Sana is run through the strip every frame and eleminate degenerate triangles. Obviously, that's unacceptable because it requires too much shuffling in memory when you cut a triangle. To avoid that, you could cut your strip into chunks. Then your hope is that VDPM just changes triangle count in a few of these chunks, which are pretty short so that you don't have to do much shuffling to cut the degeneracies. Now if you want to take this and acheive optimality again you'd have to merge up neighboring chunks whenever they get too small, and combine them if they get to big. The problem is this: if your mesh has N triangles, and you keep a bunch of little strips of target size M, then your CPU cost to fire the strips at the graphics card is O(N/M). However, your CPU cost to update a strip is O(M) (that's optimistically assuming that you only change O(1) strips per frame). To jointly minimize the two, you must have strips of average length M = O(sqrt(N)), and the result is that your LOD algorithm is O(sqrt(N)) where N is the number of triangles visible. -------------------------------------- Charles Bloom www.cbloom.com |
From: Mark D. <duc...@ll...> - 2000-09-14 21:44:44
|
Charles, Okay, I think I got the main points right, and we agree in our AGP bandwidth estimates that we are right at the limit of AGP4x for current top-end T&L cards. Also it looks like strips just aren't worth the trouble of bothering with, given the hardware trends. It is very interesting that the index updates are write only and you can expect on some hardware (all?) that the CPU doesn't stall. So I'll take all that as agreed, more or less. The two remaining issues I'll mention for now are: a) I was hoping when I first heard about the "just move the array limit" idea that no work per tri changed was needed, and that is not the case. Even though the CPU/memory usage is small per tri changed per frame, it is there. I'm woried about how this maps to technology trends. GPUs are getting faster in per-vert and per-pixel/frag *operations* faster than CPUs are getting faster at *operations*, (operations meaning if you are measuring only the compute time and not the associated memory-access time), and CPUs are getting faster at a higher rate than bus bandwidth, and bus bandwith is going up faster than memory speeds. So if I got the sequence straight, it is shortsighted to do work per changed vert that involves the CPU touching mem. I think you will have to move to the model you mentioned where you pre-optimize completely opaque (to the CPU) static chunks of geometry, and have the CPU just decide which ones to turn on and off (these chunks become the new atomic primitives from the CPU's point of view). If you are in a particularly coherent situation, or have some particularly local modifications to LOD priority (like getting object placement or line-of-sight right), then you can judiciously split up these black boxes and do more fine- grained optimization as time allows during a frame. b) I don't see why you want to use per-vert lighting with geometry LOD. In my experience this gives horrible artifacts. Again the technology trend is to get enough processing power post texture lookup to allow all the lighting effects to happen there with no lighting control needed at all at the verts. Right now you can get a complete dynamic lighting environment (i.e. equivalent to an infinite number of light sources on an infinite sphere) for diffuse-only shading using a color and normal texture with cube environment maps and multitextures. With a second cube lookup after the first you can get specular for the infinite number of light sources (right now no hardware that I know of allows this second lookup on the results of the first--paletted textures are a total hack as a substitute). You would need some blending to approximate local lighting as you move from one region of geometry to the next, but I think that could work fairly well. I just don't see why the L in T&L is at all useful in the long run. Another nice thing about putting all your lighting data in the textures is that you don't have to send it over the saturated AGP bus. Still, I think the idea is great: getting really fast updates as you slide from low- to high-res and back in a chunk for progressive streams, with a handful of cheap index tweeks that are write-only ops. I will add that to my bag of tricks... --Mark D. |
From: Tom F. <to...@mu...> - 2000-09-15 10:47:11
|
Let me be clear - I'm not saying that an app should choose to stream vertex/index data across the AGP bus. I am saying (a) the app has no choice, it is determined by the driver - this is true of both OpenGL and D3D, and (b) this is what real, existing hardware does! So it's a bit of a non-discussion. :-) Hardware does stream this data across the AGP bus, and we need to deal with that. Incidentally, U and V values are floating-point numbers, and so are 4 bytes long. Yes, we could compress them to 2-byte fixed-point WORDs, but no hardware does this (sadly). As far as skipstrip vs list (both versions indexed), my feelings about the two are these: (a) At full detail, a list will give better vertex cache behaviour (since it can do fractal-like patterns better that scale well to any cache size). (b) However, the basic list case needs tris to be in discard order, which almost completely removes this advantages, and in fact makes the vertex cache performance much worse. (c) But you could keep the list in vertex-cache-optimal order, and then just make the binned tris degenerate, instead of dropping them off the end of the list. This increases index bandwidth, but also increases vertex cache behaviour, which IMHO is much more important. (d) Skipstrips are a "no brainer" for some platforms (e.g. GF1, PS2, maybe some other consoles). They are obviously faster. (e) Thinking about it, the difference in complexity between the two is minimal. For skipstrips you need to find optimal strip order, but then for lists you also need to find an optimal order (you are not as constrained, but this just makes the combinatorial explosions worse, not better). And for both, doing the changes is the same code, it's still just changing index values. So actually I don't think there is any great difference in complexity. So, skipstrips give better vertex cache behaviour than the basic lists version, but they use the same index bandwidth whatever the actual number of tris displayed. The modified version of lists, that orders them in vertex-cache order, would get better vertex-cache performance, but the same worse index bandwidth behaviour. So when deciding whether to do skipstrips or "skiplists", I guess it's all down to which (a) is better supported by hardware and (b) gives the best vertex-cache performance. Remember that the strip will _not_ have a third the number of indices of the list - it has to restart strips, turn corners, etc, and those all bring the number of indices up. And that doubling the number of indices to get 10% better vertex caching is probably worth it. I'll have to have a look at the ROAM incremental stripping thing - would be interesting. Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. > -----Original Message----- > From: Mark Duchaineau [mailto:duc...@ll...] > Sent: 14 September 2000 19:07 > To: gda...@li... > Subject: Re: [Algorithms] VIPM With T&L - what about roam > > > Tom, > > So the picture you and Charles are painting is this (see if > I get this right): > > 1) send verts attrribs (x,y,z,u,v,...) and index arrays > across the AGP bus each frame, and let the textures and > frame buffer dominate the on-card memory bandwidth. > If all your lighting is done in textures/normal maps, and > if you use tri bintree meshes per surface "patch", then > the info per vert is (x,y,z): 3 floats=12 bytes, plus > (u,v): 2 shorts=4 bytes, total=16 bytes. Since you are > sending each vert across the AGP bus, and there is only > a dinky little cache on the other end, you have to be very > careful to arrange the order the verts are indexed to > avoid sending them multiple times. This is of coarse > dependent on what the hardware's replacement > strategy is and what the cache size is. > > 2) In the simple VIPM scheme, just send three indices > per tri, and verts and tris are listed in the arrays in > progressive split order, the indices need to be updated > for triangles whose vertices moved in any splits > introduced in the frame, which you precompute as a kind > of table lookup of state-change lists. These changes > will generally be scattered through index memory, > leading to bad PC-memory cache behavior. But you > hope that only a tiny fraction of the indices need > to be updated per frame (this is a very ROAM-like > hope ;-). The state-change info takes 14 bytes > per vert according to Charles' web page, so you > are almost doubling the mem per vert. If your > progressive scheme was tri bintree split-only order, > then there is no additional storage for index changes, > you just know what they are based on which diamonds > (which correspond one-to-one with verts) are split > (Charles alluded to this on his VIPM page). > > Per frame index transmission across the AGP bus per > tri is 3 shorts=6 bytes. So if you are very lucky and > send each vertex (16 bytes) once, then on average you > have 2 tris per vert and so 16+12=28 bytes per vert > including indices. If you are at "infinite strip optimum" > you get each vertex sent twice, leading to 2*16+12=44 > bytes sent per vert including indices. Let's imagine > you using a graphics chip capable of 30M tris/sec, > and you want to actually achieve this (ha!): this > would mean pumping 840-1320M bytes/sec > over the bus. Okay, the bus can handle this > in theory on AGP4x (1GB/sec) on the wildly > optimistic side, but not in any real situation. > Also, this is sucking up a big chunk of your PC > memory system bandwith *continuously*, > so the rest of your app is going to take a > performance nose-dive. So...does it make sense > to put some geometry info into graphics-card mem? > > Of course the optimistic scenario requires extreme > care in the order the tris are listed and indexed. > Since this is not a single static mesh, you have to > come up with index orders that are best for the > whole range of surfaces you get, not just one. > If you really want to minimize the number of times > verts are sent, you need to allow much more > index manipulation per frame to optimize, whether > via precomputed state-change lists or through > some yet unknown on-the-fly technique. > > 3) In the "stripped" version of VIPM, cover the chunk > with strips (chosen in a particular way?) and fiddle with > the indices just as in case (2). But keep drawing the > same strips. This means you send the index data > for the whole chunk at full res. This limits how > big a swing in resolution you can have in a chunk > before this cost dominates. If you are at full res > then the index cost is 1/3 of case (2). If you are > at 1/3 res then the cost is the same as case (2). > Since you are trying to force strip order, then you > will generally do no better than case (2) for > vertex on-card cache coherence, and probably > worse. The card has to expend some effort in > theory to eliminate the degenerate tris, although > this could be negligible for a good card/driver. > I don't see this as either a big win or big loss > versus the simple scheme, so I would tend to > go simple. > > Of course, you could use the incremental stripping > idea from the ROAM paper, which works on any > locally updating mesh including PM. Since you > are clearly hoping for coherence in the index-update > step, this is a cheap way to make pretty good strips. > Plus it avoids the issue of loosing vertex-cache > coherence for any but the one mesh you optmized for. > > --Mark D. |
From: Tom F. <to...@mu...> - 2000-09-15 11:16:27
|
> From: Charles Bloom [mailto:cb...@cb...] > [snip] > >2) In the simple VIPM scheme, just send three indices > >per tri, and verts and tris are listed in the arrays in > >progressive split order, the indices need to be updated > >for triangles whose vertices moved in any splits > >introduced in the frame, which you precompute as a kind > >of table lookup of state-change lists. These changes > >will generally be scattered through index memory, > >leading to bad PC-memory cache behavior. But you > >hope that only a tiny fraction of the indices need > >to be updated per frame (this is a very ROAM-like > >hope ;-). > > This is actually not a problem for the CPU, because you > never ever read from the index list. That means all you > ever do is writes, and you can just fire them off, and > they get retired asynchronously, and the CPU never stalls. > If you want to get fancy you can even tell the CPU cache > not to mirror that memory in cache, just write straight > to main memory. In DX8, there are "index buffers" that can be (and usually will be) placed in AGP memory, so this behaviour will apply automagically - AGP memory is uncached, and is written to using the writeback queue. I am sure there will be similar optimisations under OpenGL if possible. Yes, the data structures being (the collapse/expand data) is strictly linear, and so well-cached. At the moment, the destination is in system memory and fairly random, and so is poorly cached (sadly on x86 architectures, the memory is read into the cache, even if you only ever write to it), but this is a feature of the API rather than the hardware, and once they move to AGP memory (DX8 is released in about a month or two), this will be sorted. > The wonderful Athlon chip can have 32 > outstanding queue'd stores, so this is no problemo. The > problem would only arise if you write and read from the same > data structure, which we would never do. Note that you must, > however, wait a while before rendering after you make your > index changes, or your AGP DMA will stall on the CPU stores > finishing. If this were the bottleneck, we'd be golden. Since drivers typically do around a frame's worth of data buffering anyway, this is almost never a problem. The writes will be finished well before the chip needs them. [snip] > For example, consider the "skip strips" of El-Sana et. al. Whoops - just realised that in my previous mails, I've been using "skipstrips" in an ambiguous way. What I mean is "VIPM skipstrips", i.e. VIPM strips that just keep the existing strip, but make some tris in the middle of it degenerate when they collapse an edge. I don't mean that the V_D_PM part of El-Sana et. al. is used, just the trick of using strips. [snip] > Charles Bloom www.cbloom.com Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. |
From: Tom F. <to...@mu...> - 2000-09-15 12:21:37
|
> From: Mark Duchaineau [mailto:duc...@ll...] [snip] > b) I don't see why you want to use per-vert lighting with > geometry LOD. In my experience this gives horrible > artifacts. Again the technology trend is to get enough > processing power post texture lookup to allow all the lighting > effects to happen there with no lighting control needed at > all at the verts. Right now you can get a complete > dynamic lighting environment (i.e. equivalent to an > infinite number of light sources on an infinite sphere) > for diffuse-only shading using a color and normal > texture with cube environment maps and multitextures. > With a second cube lookup after the first you can get > specular for the infinite number of light sources (right now > no hardware that I know of allows this second lookup > on the results of the first--paletted textures are a > total hack as a substitute). You would need some > blending to approximate local lighting as you > move from one region of geometry to the next, > but I think that could work fairly well. I just don't see > why the L in T&L is at all useful in the long run. Another > nice thing about putting all your lighting data in the > textures is that you don't have to send it over the > saturated AGP bus. Ideally, you want a texture-based lighting approach, but with the hard work done by the graphics card, not the CPU. I use bumpmapping (of any of the three types), for exactly the reasons you state - it pops far less - but ideally I'd want to pass the normal vector (or some analagous info) into the card and have it compute the necessary UV offsets, etc. So there is still some sort of shading info that needs sending to the card, which is all we care about for bandwidth calculations. > Still, I think the idea is great: getting really fast updates > as you slide from low- to high-res and back in a chunk > for progressive streams, with a handful of cheap index > tweeks that are write-only ops. I will add that to my > bag of tricks... A convert! :-) > --Mark D. Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. |
From: Mark D. <duc...@ll...> - 2000-09-15 16:29:33
|
Tom Forsyth wrote: > > From: Mark Duchaineau [mailto:duc...@ll...] > > [snip] > > > b) I don't see why you want to use per-vert lighting with > > geometry LOD. In my experience this gives horrible > > artifacts. [snip]. > > With a second cube lookup after the first you can get > > specular for the infinite number of light sources (right now > > no hardware that I know of allows this second lookup > > on the results of the first--paletted textures are a > > total hack as a substitute). You would need some > > blending to approximate local lighting as you > > move from one region of geometry to the next, > > but I think that could work fairly well. I just don't see > > why the L in T&L is at all useful in the long run. Another > > nice thing about putting all your lighting data in the > > textures is that you don't have to send it over the > > saturated AGP bus. > > Ideally, you want a texture-based lighting approach, but with the hard work > done by the graphics card, not the CPU. I use bumpmapping (of any of the > three types), for exactly the reasons you state - it pops far less - but > ideally I'd want to pass the normal vector (or some analagous info) into the > card and have it compute the necessary UV offsets, etc. So there is still > some sort of shading info that needs sending to the card, which is all we > care about for bandwidth calculations. I was referring to a bump-mapping method I have used in a fast software- only renderer I wrote several years ago that will in the near future map well to graphics hardware, but does not today. In this method, literally all the lighting information is static and in textures, only purely geometric information is stored at the vertices. The extra bit of hardware that doesn't exist is the second foll-blown texture lookup that takes as input the results of the first texture lookup(s). No CPU work and no CPU memory used at all. Check out http://muldoon.cipic.ucdavis.edu/~duchaine/mesh.html for some images generated this way (all with a rather boring lighting environment with only 35 light sources, and smooth objects but with fully-resolved actual normals per pixel because these were all made with variants on Catmull-Clark subdivision). The sample code is part of my LibGen open source distribution at http://www.cognigraph.com/LibGen (see LibGen/Surftools/Mesh/mesh.l). Sorry, Linux/UNIX-like systems only...and no ROAM source released yet :(. As for the vertices, ROAM-style meshes in principle require only a bit or so per triangle to store the equivalent information that index lists store. The decode logic could be done pipelined for "free" on graphics hardware with a tiny bit of gate logic and a small 1D array of shorts. Without this lovely fantasy hardware, you still only need 8-bits each in U and V, although the OpenGL API I use only has a "nice" API call for U,V as shorts and xyz as floats for this, so I stick with that. This all means I get my verts down to 16 bytes for triangle bintree progressive arrays. That's a lot of CPU memory and AGP bandwidth saved, at least on dirvers that support that packing format (maybe none do if your indications are correct--that will take some testing). Like I said earlier, it is now known how to convert most any geometry to 2^n+1 by 2^n+1 patches that connect on their mutual edges. This is great for the new wavelet compression methods that our group and the Cal Tech/Bell Labs crowd are working on. This is also great for triangle bintree hierarchies and for organizing your textures. All the connectivity information is implied and the split/merge rules are just fast logic, no tables required unlike PMs. So I think VI-ROAM within chunks is very promising, and VD-ROAM at the macro level compliments this well. So as you can see I'm only a partial convert... ;-) --Mark D. |
From: Charles B. <cb...@cb...> - 2000-09-20 22:03:27
|
Well, I'm now using "thick" BSP planes as suggested to me by Angel Popov. This basically means treating all your BSP planes as having some thickness; then any vert which lies in that thickness is treated as being already on the plane. When you clip a poly against a plane, the verts which are "on" the plane are not modified. This prevents you from clipping segments where one point has a distance of 999 and the other has a distance of -0.001 The problem is that the clipper is still not consistent. That is, if I take a polygon and clip it against a plane to produce two new polygons, and then test those against the plane, they should tell me they're on the front and back sides; instead they can occasionally report that they intersect the plane, because a vertex produced by clipping was not inside the thickness of the plane. The only solution I can think of for this is to have a larger epsilon for plane-testing than for plane-clipping (double seems to work), but that concerns me a bit (is there a way that could lead to inconsistency?). -------------------------------------- Charles Bloom www.cbloom.com |
From: Charles B. <cb...@cb...> - 2000-09-21 18:17:07
|
Yep, as Angel has wisely directed me, it was actually just precision problems I was having. Let me write up my understanding on the situation to make sure I'm grasping what's going on : Precision is a big problem in games and 3d in general, because we have a really pathological case : we have small, detailed models in 3d, whose vertices are very close together. On the other hand, those models sit in very large worlds where you can have models that are very far apart. This gives you a big range in coordinates. For example, imagine you want the distance between two points in a model. If I take those points in model space, v1m and v2m, then I get a distance dm = | v1m - v2m | But what if I instead transform them to world-space first, and then do the distance. If my transform is just a translation by w I get : dw = | (v1m+w) - (v2m+w) | Now if dm is very small, and w is very large (which are both typical in games) then my distance dw is horribly inaccurate. Floats only have 24 bits of mantissa precision, which you can use up fast. Imagine a human on a landscape. Important verts on the human model may only be 1 cm apart. The landscape could be 10 km square. The difference is 1 million, or 20 bits, and suddenly I'm doing math in 4 bits of precision! You can avoid these precision nightmares most of the time by never going to world-space. That is, light in model space, render using a direction model-to-view transform, etc. Also, collision testing between two models should take one model to the space of another, instead of taking them both into world space. Unfortunately, this problem cannot be avoided for whole-world spatial structures like a BSP. You must take all your models into world-space when you put their polys in the BSP, since the BSP must all be in one common space. The only way to save yourself from nightmares is to make sure your epsilon is big enough to consistently hide all precision problems. The epsilon, or plane thickness, is not really fixing your precision problems, it's just catching them in a consistent way (eg. treating segments that are too small to handle as just points, and such). If you have 24 bits of float precision, then you must take your whole-world size and divide by around 256k (18 bits, leaving 6 for accuracy), and use that as your epsilon. This puts a pretty hard clamp on the size and detail of the world that can be modeled in floats - a roughly 1 km square area is all you get, if you want accuracy on the order of human fingers. At 11:08 AM 9/21/2000 +0200, Angel Popov wrote: >If I understood you correctly - the newly created split vertex >does not lie within +/-EPSILON distance from the plane >that generated it. There are tree reasonable ways to handle >this: >1. Increase EPSILON. It MUST be big enough to handle such >precision errors. >2. Increase your FP precision by using doubles. >3. Split polygons with very long edges to smaller polygons. >I am quite certain that you get such inconsistencies only >when splitting long edges. -------------------------------------- Charles Bloom www.cbloom.com |
From: Angel P. <ju...@bi...> - 2000-09-22 09:44:22
|
> Unfortunately, this problem cannot be avoided for whole-world > spatial structures like a BSP. You must take all your models into > world-space when you put their polys in the BSP, since the BSP must > all be in one common space. You can use octree or kd-tree and put BSP trees in the leaves (each leaf will have it's own space ). Even without the FP precision problems this is very usefull because it will give you much better tree balance ( faster collision detection and raytracing ) and will also speed up the compilation process quite a lot, especialy with complex geometry. I think that the Lithtech engine does something similar. Currently when building the root BSP nodes I first pick some axial aligned split planes that have equal number of polygons on both sides, but I don't redefine the space. |
From: Angel P. <ju...@bi...> - 2000-09-21 07:27:22
|
If I understood you correctly - the newly created split vertex does not lie within +/-EPSILON distance from the plane that generated it. There are tree reasonable ways to handle this: 1. Increase EPSILON. It MUST be big enough to handle such precision errors. 2. Increase your FP precision by using doubles. 3. Split polygons with very long edges to smaller polygons. I am quite certain that you get such inconsistencies only when splitting long edges. > Well, I'm now using "thick" BSP planes as suggested > to me by Angel Popov. This basically means treating > all your BSP planes as having some thickness; then > any vert which lies in that thickness is treated as > being already on the plane. When you clip a poly > against a plane, the verts which are "on" the plane > are not modified. This prevents you from clipping > segments where one point has a distance of 999 > and the other has a distance of -0.001 > > The problem is that the clipper is still not consistent. > That is, if I take a polygon and clip it against a > plane to produce two new polygons, and then test those > against the plane, they should tell me they're on the front > and back sides; instead they can occasionally report that > they intersect the plane, because a vertex produced by > clipping was not inside the thickness of the plane. > > The only solution I can think of for this is to have a > larger epsilon for plane-testing than for plane-clipping > (double seems to work), but that concerns me a bit (is > there a way that could lead to inconsistency?). > > > > -------------------------------------- > Charles Bloom www.cbloom.com > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > http://lists.sourceforge.net/mailman/listinfo/gdalgorithms-list |
From: Jamie F. <j.f...@re...> - 2000-09-22 08:48:41
|
[Snip] > Now if dm is very small, and w is very large (which are both > typical in games) then my distance dw is horribly inaccurate. We've had this problem with physics.... [Snip] > You can avoid these precision nightmares most of the time by > never going to world-space. That is, light in model space, render > using a direction model-to-view transform, etc. Also, collision > testing between two models should take one model to the space of > another, instead of taking them both into world space. > > Unfortunately, this problem cannot be avoided for whole-world > spatial structures like a BSP. You must take all your models into > world-space when you put their polys in the BSP, since the BSP must > all be in one common space. Is this really true? I would imagine it's the straight forward way to use the BSP. But couldn't you redefine your space at each node in the tree? So you'd have (as one path down the tree) planes P0, P1, P2, ..., Pn, defined as ( x0, y0, z0, d0 ), etc. At node i, use ( xi, yi, zi ) * di as your origin. You now have as much precision as you can get when deciding which side of the plane something is. You'd have lost precision along the way (just as you would have had you performed the calculation in world space), but you'll get more consistent decisions.... You could also define each plane in the BSP relative to its parent, which I think may increase the precision of the planes (although it won't help with the accuracy of objects which really are in world space). I've not spent too much thought on this, and it's still early in the morning... so, am I just spouting? Jamie Virus Scanned and cleared ok |
From: Charles B. <cb...@cb...> - 2000-09-22 17:32:21
|
Well, of course there is hope for your BSP, as has been pointed out by many. I'm probably going to do my BSP in grid- aligned chunks and take each object into the space of the chunk before adding it. This not only helps precision, it helps the localization of your BSP, reduces splitting, improves memory coherency, etc. Thank god for the algorithms list :^) -------------------------------------- Charles Bloom www.cbloom.com |