From: Roland S. <rsc...@hi...> - 2005-02-09 19:58:15
|
Some more numbers, this time from a 9000Pro (64MB). In contrast to the quite slow 7200sdr with only 2.6GB/s ram, this one has 8.8GB/s bandwidth (128bit/275Mhz DDR). Not to mention the chip is certainly faster too. Test sytem is also faster though, A64 3000+ socket 754, 3.2GB/s system memory bandwidth. Desktop resolution 1280x1024, quake3 windowed 1024x768. The hacks required to disable the heaps are exactly the same as those used on r100 (except of course the nr_heaps assertion had to go..., yes this hack DOES break the client stuff). Local texture size is 35MB, unless otherwise noted (I just changed the allocation scheme in the ddx driver, so only 1 framebuffer worth of pixmap cache is used instead of 3, btw without really any noticeable impact on 2d performance). GART size is 32MB unless specifically stated. Desktop resolution 1280x1024, quake3 windowed 1024x768. AGP 4x, local: 125 fps AGP 4x, both: 80-123 fps AGP 4x, gart only: 68 fps AGP 1x, local: 115 fps AGP 1x, gart only: 21 fps Some rtcw (demo checkpoint) results too (fullscreen 1024x768). AGP 4x, local: 70 fps AGP 4x, local 45MB: 85 fps AGP 4x, both: 62-77 fps AGP 4x, both, gart 64MB: 58-68 fps AGP 4x, gart only, 64MB: 47 fps AGP 1x, local: 56 fps AGP 1x, gart only: 14 fps texdown AGP 4x gart: 230MB/s texdown AGP 4x local: 650MB/s! texdown AGP 1x gart: 117MB/s before q3, 89MB/s after q3 (?) texdown AGP 1x local: 265MB/s!!! There seemed to be a problem with gart texturing and AGP modes lower than 4x, agpgart reported "Putting AGP V2 device at 0000:00:00.0 into 0x mode", glxinfo still reported AGP 1x and 2x, respectively. 1x and 2x results were identical, and to put it simply, the results downright appalling. This may be a problem with agpgart (using the version from kernel 2.6.10). However, I was amazed at the texdown performance to local graphics memory, as it's VERY close to the theoretical limit. texdown performance with AGP 4x was also quite good. the rtcw checkpoint demo exceeds "in-use" texture size of 35MB, that's why I've put in some results with larger local texture size (as well as increased the gart size). 45MB is enough though, with 35MB you'd get some occasional drops to around 12fps (and 6fps with agp 1x), these are completely gone with 45MB. Performance with gart texturing, even in 4x mode, takes a big hit (almost 50%). I was not really able to get consistent performance results when both texture heaps were active, I guess it's luck of the day which textures got put in the gart heap and which ones in the local heap. But that performance indeed got faster with a smaller gart heap is not a good sign. And even if the maximum obtained in rtcw with 35MB local heap and 29MB gart heap was higher than the score obtained with 35MB local heap alone, there were clearly areas which ran faster with only the local heap. It seems to me that the allocator really should try harder to use the local heap to be useful on r200 cards, moreover it is likely that you'd get quite a bit better performance when you DO have to put textures into the gart heap when you revisit that later when more space becomes available on the local heap and upload the still-used textures from the gart heap to the local heap (in fact, should be even faster than those 650MB/s, since no in-kernel-copy would be needed, it should be possible to blit it directly). Some numbers just for fun, since those are the numbers everyone wants to see... Some other OS, rtcw: 120 fps Some other OS, q3: 137 fps (this one is a bit cheated. I'm pretty sure non-fullscreen does not use pageflip. Fullscreen score was 174 fps, whereas we only improved from 125 fps to 129 fps...) This ain't that bad. I'd be happy if we'd do that well in say, ut2k4 or doom3... Roland |
From: Jon S. <jon...@gm...> - 2005-02-09 20:13:49
|
Is there a tool for dumping stats on which textures are in which heap? -- Jon Smirl jon...@gm... |
From: Felix <fx...@gm...> - 2005-02-09 21:10:40
|
Am Mittwoch, den 09.02.2005, 20:58 +0100 schrieb Roland Scheidegger: > Some more numbers, this time from a 9000Pro (64MB). In contrast to the=20 > quite slow 7200sdr with only 2.6GB/s ram, this one has 8.8GB/s bandwidth=20 > (128bit/275Mhz DDR). Not to mention the chip is certainly faster too. > Test sytem is also faster though, A64 3000+ socket 754, 3.2GB/s system=20 > memory bandwidth. > Desktop resolution 1280x1024, quake3 windowed 1024x768. > The hacks required to disable the heaps are exactly the same as those=20 > used on r100 (except of course the nr_heaps assertion had to go..., yes=20 > this hack DOES break the client stuff). > Local texture size is 35MB, unless otherwise noted (I just changed the=20 > allocation scheme in the ddx driver, so only 1 framebuffer worth of=20 > pixmap cache is used instead of 3, btw without really any noticeable=20 > impact on 2d performance). GART size is 32MB unless specifically stated. >=20 > Desktop resolution 1280x1024, quake3 windowed 1024x768. >=20 > AGP 4x, local: 125 fps > AGP 4x, both: 80-123 fps > AGP 4x, gart only: 68 fps > AGP 1x, local: 115 fps > AGP 1x, gart only: 21 fps >=20 > Some rtcw (demo checkpoint) results too (fullscreen 1024x768). > AGP 4x, local: 70 fps > AGP 4x, local 45MB: 85 fps > AGP 4x, both: 62-77 fps > AGP 4x, both, gart 64MB: 58-68 fps > AGP 4x, gart only, 64MB: 47 fps > AGP 1x, local: 56 fps > AGP 1x, gart only: 14 fps >=20 Thanks for these numbers. They show that the current memory management strategies are far from perfect. Read on below for some ideas how to improve it. >=20 > texdown AGP 4x gart: 230MB/s > texdown AGP 4x local: 650MB/s! > texdown AGP 1x gart: 117MB/s before q3, 89MB/s after q3 (?) > texdown AGP 1x local: 265MB/s!!! >=20 > There seemed to be a problem with gart texturing and AGP modes lower=20 > than 4x, agpgart reported "Putting AGP V2 device at 0000:00:00.0 into 0x=20 > mode", glxinfo still reported AGP 1x and 2x, respectively. 1x and 2x=20 > results were identical, and to put it simply, the results downright=20 > appalling. This may be a problem with agpgart (using the version from=20 > kernel 2.6.10). However, I was amazed at the texdown performance to=20 > local graphics memory, as it's VERY close to the theoretical limit.=20 > texdown performance with AGP 4x was also quite good. Keith committed a fastpath for Mesa's texstore functions that reduced the CPU-overhead of the rgba 32bit texture uploads significantly. >=20 > the rtcw checkpoint demo exceeds "in-use" texture size of 35MB, that's=20 > why I've put in some results with larger local texture size (as well as=20 > increased the gart size). 45MB is enough though, with 35MB you'd get=20 > some occasional drops to around 12fps (and 6fps with agp 1x), these are=20 > completely gone with 45MB. > Performance with gart texturing, even in 4x mode, takes a big hit=20 > (almost 50%). > I was not really able to get consistent performance results when both=20 > texture heaps were active, I guess it's luck of the day which textures=20 > got put in the gart heap and which ones in the local heap. But that=20 > performance indeed got faster with a smaller gart heap is not a good=20 > sign. And even if the maximum obtained in rtcw with 35MB local heap and=20 > 29MB gart heap was higher than the score obtained with 35MB local heap=20 > alone, there were clearly areas which ran faster with only the local heap= . > It seems to me that the allocator really should try harder to use the=20 > local heap to be useful on r200 cards, moreover it is likely that you'd=20 > get quite a bit better performance when you DO have to put textures into=20 > the gart heap when you revisit that later when more space becomes=20 > available on the local heap and upload the still-used textures from the=20 > gart heap to the local heap (in fact, should be even faster than those=20 > 650MB/s, since no in-kernel-copy would be needed, it should be possible=20 > to blit it directly). The big problem with the current texture allocator is that it can't tell which areas are really unused. Texture space is only allocated and never freed. Once the memory is "full" it starts kicking textures to upload new ones. This is the only way of "freeing" memory. Using an LRU strategy it has a good chance of kicking unused textures first, but there's no guarantee. It can't tell if a kicked texture will be needed the next instant. So trying to move textures from GART to local memory would basically mean that you blindly kick the least recently used texture(s) from local memory. If those textures are needed again soon then performance is going to suffer badly. Therefore I'm proposing a modified allocator that fails when it needs to start kicking too recently used textures (e.g. textures used in the current or previous frame). Failure would not be fatal in this case, you just keep the texture in GART memory and try again later. Actually you could use the same allocator for normal texture uploads. Just specify the current texture heap age as the limit. If you try to move textures back to local memory each time a texture is used, this would result in some kind of automatic regulation of heap usage. By kicking only textures that are several frames old in this process, you'd avoid trashing. Currently the texture heap age is only incremented on lock contention (IIRC). In this scheme you'd also increment it on buffer swaps and remember the texture heap ages of the last two buffer swaps. >=20 > Some numbers just for fun, since those are the numbers everyone wants to=20 > see... > Some other OS, rtcw: 120 fps > Some other OS, q3: 137 fps (this one is a bit cheated. I'm pretty sure=20 > non-fullscreen does not use pageflip. Fullscreen score was 174 fps,=20 > whereas we only improved from 125 fps to 129 fps...) > This ain't that bad. I'd be happy if we'd do that well in say, ut2k4 or=20 > doom3... >=20 > Roland --=20 | Felix K=FChling <fx...@gm...> http://fxk.de.vu | | PGP Fingerprint: 6A3C 9566 5B30 DDED 73C3 B152 151C 5CC1 D888 E595 | |
From: Keith W. <ke...@tu...> - 2005-02-09 21:59:19
|
Felix K=FChling wrote: > Am Mittwoch, den 09.02.2005, 20:58 +0100 schrieb Roland Scheidegger: >=20 >>Some more numbers, this time from a 9000Pro (64MB). In contrast to the=20 >>quite slow 7200sdr with only 2.6GB/s ram, this one has 8.8GB/s bandwidt= h=20 >>(128bit/275Mhz DDR). Not to mention the chip is certainly faster too. >>Test sytem is also faster though, A64 3000+ socket 754, 3.2GB/s system=20 >>memory bandwidth. >>Desktop resolution 1280x1024, quake3 windowed 1024x768. >>The hacks required to disable the heaps are exactly the same as those=20 >>used on r100 (except of course the nr_heaps assertion had to go..., yes= =20 >>this hack DOES break the client stuff). >>Local texture size is 35MB, unless otherwise noted (I just changed the=20 >>allocation scheme in the ddx driver, so only 1 framebuffer worth of=20 >>pixmap cache is used instead of 3, btw without really any noticeable=20 >>impact on 2d performance). GART size is 32MB unless specifically stated= . >> >>Desktop resolution 1280x1024, quake3 windowed 1024x768. >> >>AGP 4x, local: 125 fps >>AGP 4x, both: 80-123 fps >>AGP 4x, gart only: 68 fps >>AGP 1x, local: 115 fps >>AGP 1x, gart only: 21 fps >> >>Some rtcw (demo checkpoint) results too (fullscreen 1024x768). >>AGP 4x, local: 70 fps >>AGP 4x, local 45MB: 85 fps >>AGP 4x, both: 62-77 fps >>AGP 4x, both, gart 64MB: 58-68 fps >>AGP 4x, gart only, 64MB: 47 fps >>AGP 1x, local: 56 fps >>AGP 1x, gart only: 14 fps >> >=20 >=20 > Thanks for these numbers. They show that the current memory management > strategies are far from perfect. Read on below for some ideas how to > improve it. >=20 >=20 >>texdown AGP 4x gart: 230MB/s >>texdown AGP 4x local: 650MB/s! >>texdown AGP 1x gart: 117MB/s before q3, 89MB/s after q3 (?) >>texdown AGP 1x local: 265MB/s!!! >> >>There seemed to be a problem with gart texturing and AGP modes lower=20 >>than 4x, agpgart reported "Putting AGP V2 device at 0000:00:00.0 into 0= x=20 >>mode", glxinfo still reported AGP 1x and 2x, respectively. 1x and 2x=20 >>results were identical, and to put it simply, the results downright=20 >>appalling. This may be a problem with agpgart (using the version from=20 >>kernel 2.6.10). However, I was amazed at the texdown performance to=20 >>local graphics memory, as it's VERY close to the theoretical limit.=20 >>texdown performance with AGP 4x was also quite good. >=20 >=20 > Keith committed a fastpath for Mesa's texstore functions that reduced > the CPU-overhead of the rgba 32bit texture uploads significantly. I think the radeon actually uses a rgba8888 intnernal format, unlike the=20 argb8888 everything else does (which was the subject of the commit).=20 That means mesa will upload GL_RGBA textures with a straight memcpy,=20 though it still hits a very slow path with "near miss" texture formats. Keith |
From: Felix <fx...@gm...> - 2005-02-10 17:55:11
Attachments:
staletex.diff
|
Am Mittwoch, den 09.02.2005, 22:12 +0100 schrieb Felix K=FChling:=20 > Am Mittwoch, den 09.02.2005, 20:58 +0100 schrieb Roland Scheidegger: [snip]=20 > > Performance with gart texturing, even in 4x mode, takes a big hit=20 > > (almost 50%). > > I was not really able to get consistent performance results when both=20 > > texture heaps were active, I guess it's luck of the day which textures=20 > > got put in the gart heap and which ones in the local heap. But that=20 > > performance indeed got faster with a smaller gart heap is not a good=20 > > sign. And even if the maximum obtained in rtcw with 35MB local heap and= =20 > > 29MB gart heap was higher than the score obtained with 35MB local heap=20 > > alone, there were clearly areas which ran faster with only the local he= ap. > > It seems to me that the allocator really should try harder to use the=20 > > local heap to be useful on r200 cards, moreover it is likely that you'd= =20 > > get quite a bit better performance when you DO have to put textures int= o=20 > > the gart heap when you revisit that later when more space becomes=20 > > available on the local heap and upload the still-used textures from the= =20 > > gart heap to the local heap (in fact, should be even faster than those=20 > > 650MB/s, since no in-kernel-copy would be needed, it should be possible= =20 > > to blit it directly). >=20 > The big problem with the current texture allocator is that it can't tell > which areas are really unused. Texture space is only allocated and never > freed. Once the memory is "full" it starts kicking textures to upload > new ones. This is the only way of "freeing" memory. Using an LRU > strategy it has a good chance of kicking unused textures first, but > there's no guarantee. It can't tell if a kicked texture will be needed > the next instant. So trying to move textures from GART to local memory > would basically mean that you blindly kick the least recently used > texture(s) from local memory. If those textures are needed again soon > then performance is going to suffer badly. >=20 > Therefore I'm proposing a modified allocator that fails when it needs to > start kicking too recently used textures (e.g. textures used in the > current or previous frame). Failure would not be fatal in this case, you > just keep the texture in GART memory and try again later. Actually you > could use the same allocator for normal texture uploads. Just specify > the current texture heap age as the limit. >=20 > If you try to move textures back to local memory each time a texture is > used, this would result in some kind of automatic regulation of heap > usage. By kicking only textures that are several frames old in this > process, you'd avoid trashing. >=20 > Currently the texture heap age is only incremented on lock contention > (IIRC). In this scheme you'd also increment it on buffer swaps and > remember the texture heap ages of the last two buffer swaps. I simplified this idea a little further and attached a patch against texmem.[ch]. It frees stale textures (and also place holders for other clients' textures) that havn't been used in 1 second when it runs out of space on a texture heap. This way it will try a bit harder to put textures into the first heap before using the second heap, without much risk (I hope) of performance regressions. I tested this on a ProSavageDDR where rendering speed appears to be the same with local and GART textures. There was no measurable performance regression in Quake3 and I noticed no subjective performance regression in Torcs or Quake1 either. Now the only thing missing in texmem.c for migrating textures from GART to local memory would be a flag to driAllocateTexture to stop trying if kicking stale textures didn't free up enough space (on the first texture heap). Anyway, I think the attached patch should already make a difference as it is. I'd be interested how much it improves your performance numbers with Quake3 and rtcw on r200 when both texture heaps are enabled. >=20 [snip] Regards, Felix --=20 | Felix K=FChling <fx...@gm...> http://fxk.de.vu | | PGP Fingerprint: 6A3C 9566 5B30 DDED 73C3 B152 151C 5CC1 D888 E595 | |
From: Jon S. <jon...@gm...> - 2005-02-10 20:33:31
|
I haven't looked at the texture heap management code, but one simple idea for heap management would be to cascade the on-board heap to the AGP one. How does the current algorithm work? Does an algorithm like the one below have merit? It should sort the hot textures on-board, and single use textures should fall out of the cache. 1) load all textures initially in the on-board heap. Since if you are loading them you're probably going to use them. 2) Do LRU with the on-board heap. 3) When you run out of space on-board, demote the end of the LRU list to the top of the AGP heap and copy the texture between heaps. 4) Run LRU on the AGP heap. 5) When it runs out of space lose the item. 6) an added twist would be if the top of the AGP heap gets hit too often knock it out of cache so that it will get reloaded on-board. Jon Smirl jon...@gm... |
From: Felix <fx...@gm...> - 2005-02-10 22:11:22
|
Am Donnerstag, den 10.02.2005, 15:31 -0500 schrieb Jon Smirl: > I haven't looked at the texture heap management code, but one simple > idea for heap management would be to cascade the on-board heap to the > AGP one. How does the current algorithm work? Does an algorithm like > the one below have merit? It should sort the hot textures on-board, > and single use textures should fall out of the cache. >=20 > 1) load all textures initially in the on-board heap. Since if you are > loading them you're probably going to use them. Drivers usually upload textures to the hardware just before binding them to a hardware texture unit. So this assumption is always true. > 2) Do LRU with the on-board heap.=20 > 3) When you run out of space on-board, demote the end of the LRU list > to the top of the AGP heap and copy the texture between heaps. This means you copy a texture when you don't know if or when you're going to need it again. So the move of the texture may just be a waste of time. It would be better to just kick the texture and upload it again later when it's really needed. > 4) Run LRU on the AGP heap. > 5) When it runs out of space lose the item. > 6) an added twist would be if the top of the AGP heap gets hit too > often knock it out of cache so that it will get reloaded on-board. I'd rather reverse your scheme. Upload a texture to the GART heap first, because that's potentially faster (though not with the current implementation in the radeon drivers). When the texture is needed more frequently, try promoting it to the local texture heap. This scheme would give good results with movie players that need fast texture uploads and typically use each texture exactly once. It would also improve performance with games, simulations, ... that tend to use the same textures many times and benefit from the higher memory bandwidth when accessing local textures. >=20 >=20 > Jon Smirl > jon...@gm... >=20 --=20 | Felix K=FChling <fx...@gm...> http://fxk.de.vu | | PGP Fingerprint: 6A3C 9566 5B30 DDED 73C3 B152 151C 5CC1 D888 E595 | |
From: Jon S. <jon...@gm...> - 2005-02-10 22:36:22
|
On Thu, 10 Feb 2005 23:13:30 +0100, Felix K=FChling <fx...@gm...> wrote: > This means you copy a texture when you don't know if or when you're > going to need it again. So the move of the texture may just be a waste > of time. It would be better to just kick the texture and upload it again > later when it's really needed. I suspect this extra texture copy wouldn't be noticable except when you construct a test program which articifically triggers it. Most games will achieve a steady state with their loaded textures after a frame or two and the copies will stop. > I'd rather reverse your scheme. Upload a texture to the GART heap first, > because that's potentially faster (though not with the current > implementation in the radeon drivers). When the texture is needed more > frequently, try promoting it to the local texture heap. I thought about this, but there is no automatic way to figure out when to promote from GART to local. Same problem when local overflows, what do you demote to AGP? You still have copies with this scheme too. Going first to local and then demoting to AGP sorts everything automatically. It may cause a little more churn in the heaps, but the advantage is that the algorithm is very simple and doesn't need much tuning. The only tunable parameter is determining when the top of the AGP heap is "hot" and booting it. You could use something simple like boot after 500 accesses. --=20 Jon Smirl jon...@gm... |
From: Felix <fx...@gm...> - 2005-02-11 00:06:52
|
Am Donnerstag, den 10.02.2005, 17:40 -0500 schrieb Jon Smirl:=20 > On Thu, 10 Feb 2005 23:13:30 +0100, Felix K=FChling <fx...@gm...> wrot= e: > > This scheme would give good results with movie players that need fast > > texture uploads and typically use each texture exactly once. It would >=20 > Movie players aren't even close to being texture bandwidth bound. The That's not my experience. Optimizations in the texture upload path, using the AGP heap and partial texture uploads had a big impact on mplayer -vo gl performance on my ProSavageDDR (factor 2-3 all of them taken together). > demote from local to AGP scheme would cause two copies on each frame > but there is plenty of bandwidth. But this assumes that the movie > player creates a new texture for each frame. >=20 > A better scheme for a movie player would be to create a single texture > and then keep replacing it's contents. You're right, that's what actually happens in mplayer. It uses glTexSubImage2D because it typically changes only a part of a texture with power-of-two dimensions. > Or use two textures and double > buffer. But once created these textures would not move in the LRU list > unless you started something like a game in another window. Yes, they would move in the LRU list. That's why it's called "least recently used" not "least recently created". ;-) So I would have to modify my scheme to reset the usage count/frequency when a texture image is changed, such that a texture that is updated very frequently would not be promoted to local memory. Am Donnerstag, den 10.02.2005, 17:34 -0500 schrieb Jon Smirl: > On Thu, 10 Feb 2005 23:13:30 +0100, Felix K=FChling <fx...@gm...> wrot= e: > > This means you copy a texture when you don't know if or when you're > > going to need it again. So the move of the texture may just be a waste > > of time. It would be better to just kick the texture and upload it agai= n > > later when it's really needed. >=20 > I suspect this extra texture copy wouldn't be noticable except when > you construct a test program which articifically triggers it. Most > games will achieve a steady state with their loaded textures after a > frame or two and the copies will stop. Still this copy is unnecessary at the time. Delaying the re-upload to the time when the texture is needed again has only advantages and is not difficult to implement. >=20 > > I'd rather reverse your scheme. Upload a texture to the GART heap first= , > > because that's potentially faster (though not with the current > > implementation in the radeon drivers). When the texture is needed more > > frequently, try promoting it to the local texture heap. >=20 > I thought about this, but there is no automatic way to figure out when > to promote from GART to local. Yes there is. In the current scheme, whenever a texture is bound to a hardware tex unit the driver calls driUpdateTexLRU, which moves the texture to the front of the LRU list. In this function you could easily count how often or how frequently a texture has been used. Based on this information and maybe the texture size you could decide which textures to promote and when. You will keep promoting textures until the local heap is full of non-stale textures. > Same problem when local overflows, what > do you demote to AGP? You still have copies with this scheme too. Textures are sorted in LRU-order on the texture heaps. So you always kick least recently used textures first. It has always worked like this even in the current scheme. For promoting textures I would only kick stale textures from the local heap. >=20 > Going first to local and then demoting to AGP sorts everything > automatically. It may cause a little more churn in the heaps, In my experience texture uploads are quite expensive. So IMO avoiding unnecessary texture uploads or copies should have a high priority. > but the > advantage is that the algorithm is very simple and doesn't need much > tuning. The only tunable parameter is determining when the top of the > AGP heap is "hot" and booting it. You could use something simple like > boot after 500 accesses. I don't think my algorithm is much more complicated. It can be implemented by gradual improvements of the current algorithm (freeing stale texture memory is one step) which helps avoiding unexpected performance regressions. At the moment I'm not planning to rewrite it from scratch, especially because I can't test on any hardware where I can actually measure great performance improvements ATM. The only tunable parameter in my algorithm is how often/frequently used a texture must be in order to try to promote it to the local texture heap. Maybe there are a few more degrees of freedom, because you can also consider the texture size for promotion. I think the steady state result would be about the same as with your algorithm, but I expect my scheme to work better when textures are used very infrequently or updated very frequently (movie players). In particular this would make the texture_heaps option unnecessary, which is a good think IMO (good performance without tuning is good for Joe Average User). Anyway, anyone is free to implement an alternative algorithm for comparison. If it works better, then it will be adopted. However, I'm not convinced your algorithm is going to work better than mine (you asked for my opinion, didn't you), so I'm not going to implement it. --=20 | Felix K=FChling <fx...@gm...> http://fxk.de.vu | | PGP Fingerprint: 6A3C 9566 5B30 DDED 73C3 B152 151C 5CC1 D888 E595 | |
From: Roland S. <rsc...@hi...> - 2005-02-11 00:50:24
|
Felix K=FChling wrote: > I don't think my algorithm is much more complicated. It can be > implemented by gradual improvements of the current algorithm (freeing > stale texture memory is one step) which helps avoiding unexpected > performance regressions. At the moment I'm not planning to rewrite it > from scratch, especially because I can't test on any hardware where I > can actually measure great performance improvements ATM. I'm not sure what a really good implementation would look like, but you=20 could try lowering gart speed to 1x with a savage to see a performance=20 difference between local and gart texturing. Though I'm not convinced=20 the savages are actually fast enough to even take a hit with agp 1x... Roland |
From: Jon S. <jon...@gm...> - 2005-02-10 22:41:41
|
On Thu, 10 Feb 2005 23:13:30 +0100, Felix K=FChling <fx...@gm...> wrote: > This scheme would give good results with movie players that need fast > texture uploads and typically use each texture exactly once. It would Movie players aren't even close to being texture bandwidth bound. The demote from local to AGP scheme would cause two copies on each frame but there is plenty of bandwidth. But this assumes that the movie player creates a new texture for each frame. A better scheme for a movie player would be to create a single texture and then keep replacing it's contents. Or use two textures and double buffer. But once created these textures would not move in the LRU list unless you started something like a game in another window. --=20 Jon Smirl jon...@gm... |
From: Dave A. <ai...@li...> - 2005-02-11 00:12:16
|
> > A better scheme for a movie player would be to create a single texture > and then keep replacing it's contents. Or use two textures and double > buffer. But once created these textures would not move in the LRU list > unless you started something like a game in another window. if we supported that in any reasonable fashion (at least on radeon/r200), movie players are very texture upload bound, well at least on my embedded system, I do a lot of animation with movies, and mngs and arrays of pngs, and most of my time is spent in memcpy and texstore_rgba8888, this is a real pain for me, and I'm slowly gathering enough knowledge to do a great big hack for my own internal use, Dave. -- David Airlie, Software Engineer http://www.skynet.ie/~airlied / airlied at skynet.ie pam_smb / Linux DECstation / Linux VAX / ILUG person |
From: Jon S. <jon...@gm...> - 2005-02-11 00:35:39
|
AGP 8x should just be able to keep up with 1280x1024x24b 60 times/sec. How does mesa access AGP memory from the CPU side? AGP memory is system memory which the AGP makes visible to the GPU. Are we using the GPU to load textures into AGP memory or is it being done entirely on the main CPU with a memcopy? For things like a movie player we should even be able to give it a pointer to the texture in system memory(AGP space) and let it directly manipulate the texture buffer. Doing that would require playing with the page tables to preserve protection. -- Jon Smirl jon...@gm... |
From: Roland S. <rsc...@hi...> - 2005-02-11 00:47:53
|
Jon Smirl wrote: > AGP 8x should just be able to keep up with 1280x1024x24b 60 > times/sec. AGP 4x should be enough. Remember I got 600MB/s max throughput. Not with 24bit textures though, the Mesa RGBA-BGRA conversion takes WAY too much time to achieve that. > How does mesa access AGP memory from the CPU side? AGP memory is > system memory which the AGP makes visible to the GPU. Are we using > the GPU to load textures into AGP memory or is it being done entirely > on the main CPU with a memcopy? depends on driver. radeon/r200 use gpu blit. Might be suboptimal but at least it handles things like tiling (when the gpu blitter can do it) automatically. I'm not sure but couldn't the radeon blitter actually do rgba-bgra conversion too for instance? > For things like a movie player we should even be able to give it a > pointer to the texture in system memory(AGP space) and let it > directly manipulate the texture buffer. Doing that would require > playing with the page tables to preserve protection. This seems exactly to be what the client extension of the r200 driver is intended for. But for normal apps, it's useless (and for the most part even for apps which could make good use of it, since it's an extension almost noone uses anyway). Roland |
From: Roland S. <rsc...@hi...> - 2005-02-11 00:18:57
|
Felix K=FChling wrote: >=20 > I simplified this idea a little further and attached a patch against > texmem.[ch]. It frees stale textures (and also place holders for other > clients' textures) that havn't been used in 1 second when it runs out o= f > space on a texture heap. This way it will try a bit harder to put > textures into the first heap before using the second heap, without much > risk (I hope) of performance regressions. >=20 > I tested this on a ProSavageDDR where rendering speed appears to be the > same with local and GART textures. There was no measurable performance > regression in Quake3 and I noticed no subjective performance regression > in Torcs or Quake1 either. >=20 > Now the only thing missing in texmem.c for migrating textures from GART > to local memory would be a flag to driAllocateTexture to stop trying if > kicking stale textures didn't free up enough space (on the first textur= e > heap). >=20 > Anyway, I think the attached patch should already make a difference as > it is. I'd be interested how much it improves your performance numbers > with Quake3 and rtcw on r200 when both texture heaps are enabled. I've done a couple of benchmarks. All results are "fglrx-boosted", so to=20 speak (too lazy to reboot). q3, local 45MB or 35MB: 145 fps rtcw, local 45MB: 95 fps rtcw, local 35MB: 76 fps with both heaps, local size 35MB, GART texture size 61MB: q3, old allocator: 105-125 fps rtcw, old allocator: 70-84 fps q3, new allocator: 108-126 fps rtcw, new allocator: 71-85 fps This does not seem to really make a difference. One interesting thing I noticed though is that it is actually not really=20 a "range" of results, but only some distinct values. For rtcw, the=20 scores were always very close to either 70, 77 or 85 fps (within 1=20 frame), out of 10 runs maybe 6 were around 77, 2 around 70 and 2 around=20 85. Quake3 mostly ran at around 125 fps but once every while was just=20 below 110. Roland |
From: Owen T. <ot...@re...> - 2005-02-11 02:56:10
|
Dave Airlie wrote: >>A better scheme for a movie player would be to create a single texture >>and then keep replacing it's contents. Or use two textures and double >>buffer. But once created these textures would not move in the LRU list >>unless you started something like a game in another window. > > > if we supported that in any reasonable fashion (at least on radeon/r200), > movie players are very texture upload bound, well at least on my embedded > system, I do a lot of animation with movies, and mngs and arrays of pngs, > and most of my time is spent in memcpy and texstore_rgba8888, this is a > real pain for me, and I'm slowly gathering enough knowledge to do a great > big hack for my own internal use, Perhaps a wild idea ... does APPLE_client_texture do what you want? If so then it might be a lot simpler and more reusable to test/optimize/fixup that then to start from scratch. That should allow a straight-copy from data you create to memory card the can texture from, which is about as good as possible. For subimage modification the spec seems to permit modifying the data in place then calling TexSubImage on the subregion with a pointer into the original data to notify of the change. Regards, Owen |
From: Jon S. <jon...@gm...> - 2005-02-11 03:23:39
|
On Thu, 10 Feb 2005 21:59:29 -0500, Owen Taylor <ot...@re...> wrote: > That should allow a straight-copy from data you create to memory card > the can texture from, which is about as good as possible. If you have a big AGP aperture to play with there is a faster way. When you get the call to copy the texture from user space, don't copy it. Instead mark it's page table entries as copy on write. Get the physical address of the page and set it into the GART. Now the GPU can get to it with zero copies. When you are done with it, check and see if the app caused a copy on write, if so free the page, else just remove the COW flag. -- Jon Smirl jon...@gm... |
From: Eric A. <et...@lc...> - 2005-02-11 04:15:00
|
On Thu, 2005-02-10 at 22:23 -0500, Jon Smirl wrote: > On Thu, 10 Feb 2005 21:59:29 -0500, Owen Taylor <ot...@re...> wrote: > > That should allow a straight-copy from data you create to memory card > > the can texture from, which is about as good as possible. > > If you have a big AGP aperture to play with there is a faster way. > When you get the call to copy the texture from user space, don't copy > it. Instead mark it's page table entries as copy on write. Get the > physical address of the page and set it into the GART. Now the GPU can > get to it with zero copies. When you are done with it, check and see > if the app caused a copy on write, if so free the page, else just > remove the COW flag. Is there evidence that this is/would be in fact faster? -- Eric Anholt et...@lc... http://people.freebsd.org/~anholt/ anholt@FreeBSD.org |
From: Dave A. <ai...@li...> - 2005-02-11 04:51:34
|
> > it. Instead mark it's page table entries as copy on write. Get the > > physical address of the page and set it into the GART. Now the GPU can > > get to it with zero copies. When you are done with it, check and see > > if the app caused a copy on write, if so free the page, else just > > remove the COW flag. > > Is there evidence that this is/would be in fact faster? no but I could practically guarantee anything is faster than the 3-4 copies a radeon texture goes through at the moment.. Dave. -- David Airlie, Software Engineer http://www.skynet.ie/~airlied / airlied at skynet.ie pam_smb / Linux DECstation / Linux VAX / ILUG person |
From: Jon S. <jon...@gm...> - 2005-02-11 05:23:29
|
On Thu, 10 Feb 2005 20:14:00 -0800, Eric Anholt <et...@lc...> wrote: > Is there evidence that this is/would be in fact faster? That's how the networking drivers work and they may be the fastest drivers in the system. But, it has not been coded for AGP so nobody knows for sure. It has to be faster though, having the CPU do the copy will cause the TLB cache to be flushed as you walk through all of the pages. Having the GPU do the copy is even worse since it moves across AGP. We have bigger problems to chase. Plus implementing it this way probably has a bunch of architecture specific problems I don't know about. But I'm sure it would work on the x86. After we get X on GL up on mesa-solo I can look at changing the texture copy code. -- Jon Smirl jon...@gm... |