Am Mittwoch, den 09.02.2005, 20:58 +0100 schrieb Roland Scheidegger:
> Some more numbers, this time from a 9000Pro (64MB). In contrast to the=20
> quite slow 7200sdr with only 2.6GB/s ram, this one has 8.8GB/s bandwidth=20
> (128bit/275Mhz DDR). Not to mention the chip is certainly faster too.
> Test sytem is also faster though, A64 3000+ socket 754, 3.2GB/s system=20
> memory bandwidth.
> Desktop resolution 1280x1024, quake3 windowed 1024x768.
> The hacks required to disable the heaps are exactly the same as those=20
> used on r100 (except of course the nr_heaps assertion had to go..., yes=20
> this hack DOES break the client stuff).
> Local texture size is 35MB, unless otherwise noted (I just changed the=20
> allocation scheme in the ddx driver, so only 1 framebuffer worth of=20
> pixmap cache is used instead of 3, btw without really any noticeable=20
> impact on 2d performance). GART size is 32MB unless specifically stated.
> Desktop resolution 1280x1024, quake3 windowed 1024x768.
> AGP 4x, local: 125 fps
> AGP 4x, both: 80-123 fps
> AGP 4x, gart only: 68 fps
> AGP 1x, local: 115 fps
> AGP 1x, gart only: 21 fps
> Some rtcw (demo checkpoint) results too (fullscreen 1024x768).
> AGP 4x, local: 70 fps
> AGP 4x, local 45MB: 85 fps
> AGP 4x, both: 62-77 fps
> AGP 4x, both, gart 64MB: 58-68 fps
> AGP 4x, gart only, 64MB: 47 fps
> AGP 1x, local: 56 fps
> AGP 1x, gart only: 14 fps
Thanks for these numbers. They show that the current memory management
strategies are far from perfect. Read on below for some ideas how to
> texdown AGP 4x gart: 230MB/s
> texdown AGP 4x local: 650MB/s!
> texdown AGP 1x gart: 117MB/s before q3, 89MB/s after q3 (?)
> texdown AGP 1x local: 265MB/s!!!
> There seemed to be a problem with gart texturing and AGP modes lower=20
> than 4x, agpgart reported "Putting AGP V2 device at 0000:00:00.0 into 0x=20
> mode", glxinfo still reported AGP 1x and 2x, respectively. 1x and 2x=20
> results were identical, and to put it simply, the results downright=20
> appalling. This may be a problem with agpgart (using the version from=20
> kernel 2.6.10). However, I was amazed at the texdown performance to=20
> local graphics memory, as it's VERY close to the theoretical limit.=20
> texdown performance with AGP 4x was also quite good.
Keith committed a fastpath for Mesa's texstore functions that reduced
the CPU-overhead of the rgba 32bit texture uploads significantly.
> the rtcw checkpoint demo exceeds "in-use" texture size of 35MB, that's=20
> why I've put in some results with larger local texture size (as well as=20
> increased the gart size). 45MB is enough though, with 35MB you'd get=20
> some occasional drops to around 12fps (and 6fps with agp 1x), these are=20
> completely gone with 45MB.
> Performance with gart texturing, even in 4x mode, takes a big hit=20
> (almost 50%).
> I was not really able to get consistent performance results when both=20
> texture heaps were active, I guess it's luck of the day which textures=20
> got put in the gart heap and which ones in the local heap. But that=20
> performance indeed got faster with a smaller gart heap is not a good=20
> sign. And even if the maximum obtained in rtcw with 35MB local heap and=20
> 29MB gart heap was higher than the score obtained with 35MB local heap=20
> alone, there were clearly areas which ran faster with only the local heap=
> It seems to me that the allocator really should try harder to use the=20
> local heap to be useful on r200 cards, moreover it is likely that you'd=20
> get quite a bit better performance when you DO have to put textures into=20
> the gart heap when you revisit that later when more space becomes=20
> available on the local heap and upload the still-used textures from the=20
> gart heap to the local heap (in fact, should be even faster than those=20
> 650MB/s, since no in-kernel-copy would be needed, it should be possible=20
> to blit it directly).
The big problem with the current texture allocator is that it can't tell
which areas are really unused. Texture space is only allocated and never
freed. Once the memory is "full" it starts kicking textures to upload
new ones. This is the only way of "freeing" memory. Using an LRU
strategy it has a good chance of kicking unused textures first, but
there's no guarantee. It can't tell if a kicked texture will be needed
the next instant. So trying to move textures from GART to local memory
would basically mean that you blindly kick the least recently used
texture(s) from local memory. If those textures are needed again soon
then performance is going to suffer badly.
Therefore I'm proposing a modified allocator that fails when it needs to
start kicking too recently used textures (e.g. textures used in the
current or previous frame). Failure would not be fatal in this case, you
just keep the texture in GART memory and try again later. Actually you
could use the same allocator for normal texture uploads. Just specify
the current texture heap age as the limit.
If you try to move textures back to local memory each time a texture is
used, this would result in some kind of automatic regulation of heap
usage. By kicking only textures that are several frames old in this
process, you'd avoid trashing.
Currently the texture heap age is only incremented on lock contention
(IIRC). In this scheme you'd also increment it on buffer swaps and
remember the texture heap ages of the last two buffer swaps.
> Some numbers just for fun, since those are the numbers everyone wants to=20
> Some other OS, rtcw: 120 fps
> Some other OS, q3: 137 fps (this one is a bit cheated. I'm pretty sure=20
> non-fullscreen does not use pageflip. Fullscreen score was 174 fps,=20
> whereas we only improved from 125 fps to 129 fps...)
> This ain't that bad. I'd be happy if we'd do that well in say, ut2k4 or=20
| Felix K=FChling <fxkuehl@...> http://fxk.de.vu |
| PGP Fingerprint: 6A3C 9566 5B30 DDED 73C3 B152 151C 5CC1 D888 E595 |