RE: [Algorithms] portal engines in outdoor environments
Brought to you by:
vexxed72
From: Tom F. <to...@mu...> - 2000-08-20 10:59:05
|
> From: jason watkins [mailto:jas...@po...] > > > > But _something_ still has to do the readback. The start of > the chips' > > pipeline still needs to know the results at the framebuffer > end before it > > can decide whether or not to start drawing the triangles. It doesn't > matter > > whether the CPU does it or some bit in the T&L section of > the chip - it's > > the same issue - something has to wait for pixels to go > most of the way > down > > the rendering pipeline. And that's a performance lose. > > Right, of course the rejection logic has to be able to read > the relavent > z's. But, it doesn't have to read the most immediate state of > the z.. the > state a few polygons ago will do fine for a conservative > rejection. So I > don't see where a stall happens, or where there's a > performance lose. That's not the performance hit. You are submitting tris like this: BeginChunk Draw tester tris EndChunk if ( previous chunk rendered ) { Draw real tris } The problem is that the if() is being done at the very start of the pipeline (i.e. the AGP bus - any later and you lose most of the gain), but it needs to know the results of the rasterisation & Z-test of all the pixels in the tester tris. That rasterisation and Z-test is comparatively late in the pipeline on a T&L device (it's early in the rasterisation, but there is a looooong pipe between the AGP bus and pixel rasterisation). So your pipeline is going to be completely empty between those two points. That is a huge pipeline bubble, and if you're doing it more than one or twice a frame, you are going to lose large amounts of performance. You may be able to improve things by doing: for i = 0 to nObjects { BeginChunk(i) Draw tester tris EndChunk } for i = 0 to nObjects { If ( chunk(i) rendered ) { Draw object i } } But that is a lot of extra hardware to store all that chunk information, retrieve it and so on. Lots of complexity. There are three very nice things about the frame-to-frame coherency scheme: (1) No extra fillrate hit. If the object is invisible, it's the same fillrate as your scheme. If the object is visible, then you still only draw it once, not twice as with your scheme. (2) No extra triangles needed. OK, the bounding box is a pretty small number of tris, but what if you wanted to do this scheme with lots of smallish object? Might get significant then. (3) (and this is the biggie) It is already supported by tons of existing, normal, shipped, out there hardware. Not some mystical future device. Real, existing ones that you have probably used. > Maybe > the gain is reduce, when you think about how the rejection > means there are > skips in the flow from AGP RAM to the cards local > storage/instruction bus, > but as I understand it, that's all controled by DMA's from > the card anyhow, > so not a big deal. Huge deal if the delay is longer than a few tens of clock cycles. The AGP FIFOs are not very big, and bubbles of the sort of size you are talking about are not going to be absorbed by them. So for part of your frame, the AGP bus will be sitting idle. And if, as is happening, you are limited by AGP speed, that is going to hurt quite a lot. > It doesn't rely on frame2frame conherance (which I feel is often a bad > thing). Perhaps it would maybe be best with a heirarchical z system. Doesn't help - you still need to rasterise your pixels, which is a long way down the pipe. What's wrong with frame-to-frame coherence? Remember, if there is a camera change or cut, the application can simply discard all the visibility info it has and just draw everything, until it has vis information for the new camera position. > > Added to which, one of the growing bottlenecks is the AGP > bus, and doing > it > > this way doesn't help that at all. > > Not directly, well.. not unless your primatives are so large that the > rejection happens before the primative has entirely streamed > into the cards > local FIFO or whatever. However, done right, it could > potencially reduce > state changes or texture downloads, aleavating some bus > bandwidth. No no no. There is no way you could get the _hardware_ to reject state change info and texture downloads because of some internally fed-back state. Drivers rely very heavily on persistent state, i.e. not having to send state that doesn't change. If the driver now doesn't know whether the state change info it sent actually made it to the chip's registers or not, that's just madness - the driver will go potty trying to figure out what it does and doesn't need to send over the bus. Ditto for downloading textures. Since the driver can't know whether the hardware is going to reject the tris or not, it always has to do the texture downloads anyway. And if the hardware is doing the downloads instead (e.g. AGP or cached-AGP textures), then either solution works - the fast-Z-rejection of pixels means those textures never get fetched. Far better is the delayed system, where the app can choose as well as using lower-tri models, to force lower mipmap levels as well, to reduce texture downloads. Or not, if that looks bad. This sort of decision MUST be left up to the app of course, because there will alsways be cases where a shortcut looks bad, and only the app can know whether they are acceptable in certain cases or not. [snip] > Especially considering that we're quickly reaching the limits > of what fill > rate can be supported by available memory technologies (at > the right price > that is). And tho embeded dram seems to answer that, it and > other possible > technologies haven't exactly materialized. Right. But I've already pointed out that the delayed version uses _less_ fillrate than what you are proposing, not more, because the "tester" tris are the same as the tris actually drawn. Tom Forsyth - Muckyfoot bloke. Whizzing and pasting and pooting through the day. |