From: Thomas H. <th...@tu...> - 2008-02-28 09:09:15
|
Eric Anholt wrote: > On Thu, 2008-02-28 at 06:08 +1000, Dave Airlie wrote: > >>> I wasn't planning on a Mesa 7.1 (trunk code) release for a while, but I >>> could finish up 7.0.3 at any moment. I have to admit that I haven't >>> actually tested Mesa 7.0.3 with current X code in quite a while though. >>> >>> Before Mesa 7.1 I'd like to see a new, official DRM release. Otherwise, >>> it's hard to identify a snapshot of DRM that works with Mesa. I know I >>> always have trouble with DRM versioning otherwise. >>> >>> Is there any kind of roadmap for a new DRM release? >>> >> When TTM hits the kernel, I'll release a libdrm to work with that and >> solidify the API, >> >> however people keep finding apparently valid reasons to pick holes in >> the TTM API, however I haven't seen the discussion brought up in the >> few weeks since. >> > > http://cgit.freedesktop.org/~anholt/drm/log/?h=drm-ttm-cleanup-2 > > has some I believe obvious cleanups to the API, removing many sharp > edges. At that point the BO parts of the API are more or less tolerable > to me. The fencing code I don't understand and am very scared by still, > but most of it has left the user <-> kernel API at least. > Some important comments about the API changes, starting from below. Remove DRM_BO_FLAG_FORCE_MAPPABLE, Yes that can go away. Remove DRM_BO_HINT_WAIT_LAZY. No. This flag is intended for polling only hardware, and has no use at all in the intel driver once the sync flushes are gone. The fact that you ever saw a difference with this flag is that there was a bug in the execbuf code that caused you to hit a polling path in the fence wait mechanism. Ignore DRM_FENCE_FLAG_WAIT_LAZY. NO. Same as above. Remove unused DRM_FENCE_FLAG_WAIT_IGNORE_SIGNALS. Yes that's OK. Remove DRM_FENCE_FLAG_NO_USER No. Used by the Poulsbo X server EXA implementation and is quite valuable for small composite operations. Remove DRM_BO_FLAG_CACHED_MAPPED and make that a default behaviour. No!!! We can't do that!!! DRM_BO_FLAG_CACHED_MAPPED is creating an invalid physical page aliasing, the details of which are thoroughly explained here http://marc.info/?l=linux-kernel&m=102376926732464&w=2 And this resulted in the change_page_attr() and the dreaded global_flush_tlb() kernel calls. From what I understand it might be OK for streaming writes to the GPU (like batch-buffers) but how would you stop a CPU from prefetching invalid data from a buffer while you're writing to it from the GPU? And even write it back, overwriting what the GPU just wrote? This would break anything trying to use TTM in a consistent way. /Thomas |
From: Thomas H. <th...@tu...> - 2008-02-29 12:21:54
|
Jerome Glisse wrote: > On Fri, 29 Feb 2008 11:08:53 +0100 > Thomas Hellström <th...@tu...> wrote: > > >> 2) Relocation application. KeithPs presumed_offset stuff has to a great >> extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just >> added will take care of the rest, and I hope the mm kernel guys will >> understand the problem and accept the kmap_atomic_prot_pfn() in. I'm >> working on a patch that will do post-validation only relocations this way. >> > > kmap_atomic_prot_pfn change broke drm as FIX_KMAP is defined only for 32bits > architecture for x86. Reported on irc. > > Cheers, > Jerome Glisse <gl...@fr...> > OK. I'll fix. /Thomas |
From: seventh g. <sev...@gm...> - 2008-02-29 15:17:53
|
On Fri, Feb 29, 2008 at 12:21 PM, Thomas Hellström <th...@tu...> wrote: > Jerome Glisse wrote: > > On Fri, 29 Feb 2008 11:08:53 +0100 > > Thomas Hellström <th...@tu...> wrote: > > > > > >> 2) Relocation application. KeithPs presumed_offset stuff has to a great > >> extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just > >> added will take care of the rest, and I hope the mm kernel guys will > >> understand the problem and accept the kmap_atomic_prot_pfn() in. I'm > >> working on a patch that will do post-validation only relocations this way. > >> > > > > kmap_atomic_prot_pfn change broke drm as FIX_KMAP is defined only for 32bits > > architecture for x86. Reported on irc. > > > > Cheers, > > Jerome Glisse <gl...@fr...> > > > OK. I'll fix. It is fixed now, thanks! I've posted this on the wrong topic a moment ago, sorry about that.. Cheers, Renato Caldas |
From: Eric A. <er...@an...> - 2008-02-28 19:39:30
|
On Thu, 2008-02-28 at 10:08 +0100, Thomas Hellström wrote: > Eric Anholt wrote: > > On Thu, 2008-02-28 at 06:08 +1000, Dave Airlie wrote: > > > >>> I wasn't planning on a Mesa 7.1 (trunk code) release for a while, but I > >>> could finish up 7.0.3 at any moment. I have to admit that I haven't > >>> actually tested Mesa 7.0.3 with current X code in quite a while though. > >>> > >>> Before Mesa 7.1 I'd like to see a new, official DRM release. Otherwise, > >>> it's hard to identify a snapshot of DRM that works with Mesa. I know I > >>> always have trouble with DRM versioning otherwise. > >>> > >>> Is there any kind of roadmap for a new DRM release? > >>> > >> When TTM hits the kernel, I'll release a libdrm to work with that and > >> solidify the API, > >> > >> however people keep finding apparently valid reasons to pick holes in > >> the TTM API, however I haven't seen the discussion brought up in the > >> few weeks since. > >> > > > > http://cgit.freedesktop.org/~anholt/drm/log/?h=drm-ttm-cleanup-2 > > > > has some I believe obvious cleanups to the API, removing many sharp > > edges. At that point the BO parts of the API are more or less tolerable > > to me. The fencing code I don't understand and am very scared by still, > > but most of it has left the user <-> kernel API at least. > > > Some important comments about the API changes, starting from below. > Remove DRM_BO_FLAG_FORCE_MAPPABLE, Yes that can go away. > > Remove DRM_BO_HINT_WAIT_LAZY. No. This flag is intended for polling only > hardware, and has no use at all in the intel driver once the sync > flushes are gone. The fact that you ever saw a difference with this flag > is that there was a bug in the execbuf code that caused you to hit a > polling path in the fence wait mechanism. > > Ignore DRM_FENCE_FLAG_WAIT_LAZY. NO. Same as above. OK. We should clarify this in the ioctl descriptions so that people with sane hardware know that the flags are ignored. > Remove unused DRM_FENCE_FLAG_WAIT_IGNORE_SIGNALS. Yes that's OK. > > Remove DRM_FENCE_FLAG_NO_USER No. Used by the Poulsbo X server EXA > implementation and is quite valuable for small composite operations. > > Remove DRM_BO_FLAG_CACHED_MAPPED and make that a default behaviour. > No!!! We can't do that!!! > DRM_BO_FLAG_CACHED_MAPPED is creating an invalid physical page aliasing, > the details of which are thoroughly explained here I may have said it wrong: Make DRM_BO_FLAG_CACHED_MAPPED the default behavior if the platform can support it. The point is that it should not be userland interface -- if the kernel can manage it, then just do it. Otherwise, don't. I'd rather see us disable the performance hack for now than leave a go-faster switch in the interface. Going back over the commit, I didn't make the better behavior conditional on the platform being able to do it. Oops, I need to fix that. > http://marc.info/?l=linux-kernel&m=102376926732464&w=2 > > And this resulted in the change_page_attr() and the dreaded > global_flush_tlb() kernel calls. From what I understand it might be OK > for streaming writes to the GPU (like batch-buffers) but how would you > stop a CPU from prefetching invalid data from a buffer while you're > writing to it from the GPU? And even write it back, overwriting what the > GPU just wrote? > This would break anything trying to use TTM in a consistent way. As far as we know, Intel CPUs are not affected by the AMD limitation that read-only speculation may result in later writeback, so what we do works out. It does look like we're not flushing CPU cache at map time (bo_map_ioctl -> buffer_object_map -> bo_wait, bo_evict_cached -> bo_evict -> move_mem), which is wrong. Note that in the current implementation, when we map the buffer again, we unmap it out of the hardware. It would also be nice to not unmap it from the hardware and leave the GART mapping as-is, and just flush the cache again when validating. The 3D driver basically never hits this path at the moment, but the X server certainly would (sadly), and we may have the 3D driver doing this if we do userland buffer reuse. -- Eric Anholt anholt@FreeBSD.org er...@an... eri...@in... |
From: Thomas H. <th...@tu...> - 2008-02-29 10:09:17
|
Eric Anholt wrote: > On Thu, 2008-02-28 at 10:08 +0100, Thomas Hellström wrote: > >> Eric Anholt wrote: >> >>> On Thu, 2008-02-28 at 06:08 +1000, Dave Airlie wrote: >>> >>> >>>>> I wasn't planning on a Mesa 7.1 (trunk code) release for a while, but I >>>>> could finish up 7.0.3 at any moment. I have to admit that I haven't >>>>> actually tested Mesa 7.0.3 with current X code in quite a while though. >>>>> >>>>> Before Mesa 7.1 I'd like to see a new, official DRM release. Otherwise, >>>>> it's hard to identify a snapshot of DRM that works with Mesa. I know I >>>>> always have trouble with DRM versioning otherwise. >>>>> >>>>> Is there any kind of roadmap for a new DRM release? >>>>> >>>>> >>>> When TTM hits the kernel, I'll release a libdrm to work with that and >>>> solidify the API, >>>> >>>> however people keep finding apparently valid reasons to pick holes in >>>> the TTM API, however I haven't seen the discussion brought up in the >>>> few weeks since. >>>> >>>> >>> http://cgit.freedesktop.org/~anholt/drm/log/?h=drm-ttm-cleanup-2 >>> >>> has some I believe obvious cleanups to the API, removing many sharp >>> edges. At that point the BO parts of the API are more or less tolerable >>> to me. The fencing code I don't understand and am very scared by still, >>> but most of it has left the user <-> kernel API at least. >>> >>> >> Some important comments about the API changes, starting from below. >> Remove DRM_BO_FLAG_FORCE_MAPPABLE, Yes that can go away. >> >> Remove DRM_BO_HINT_WAIT_LAZY. No. This flag is intended for polling only >> hardware, and has no use at all in the intel driver once the sync >> flushes are gone. The fact that you ever saw a difference with this flag >> is that there was a bug in the execbuf code that caused you to hit a >> polling path in the fence wait mechanism. >> >> Ignore DRM_FENCE_FLAG_WAIT_LAZY. NO. Same as above. >> > > OK. We should clarify this in the ioctl descriptions so that people > with sane hardware know that the flags are ignored. > Indeed. The lack of documentation is disturbing and should be fixed asap. > >> Remove unused DRM_FENCE_FLAG_WAIT_IGNORE_SIGNALS. Yes that's OK. >> >> Remove DRM_FENCE_FLAG_NO_USER No. Used by the Poulsbo X server EXA >> implementation and is quite valuable for small composite operations. >> >> Remove DRM_BO_FLAG_CACHED_MAPPED and make that a default behaviour. >> No!!! We can't do that!!! >> DRM_BO_FLAG_CACHED_MAPPED is creating an invalid physical page aliasing, >> the details of which are thoroughly explained here >> > > I may have said it wrong: Make DRM_BO_FLAG_CACHED_MAPPED the default > behavior if the platform can support it. The point is that it should > not be userland interface -- if the kernel can manage it, then just do > it. Otherwise, don't. I'd rather see us disable the performance hack > for now than leave a go-faster switch in the interface. > > Going back over the commit, I didn't make the better behavior > conditional on the platform being able to do it. Oops, I need to fix > that. > > Yes, hmm, as I see it there are three performance problems that DRM_BO_FLAG_CACHED_MAPPED attempts to address: 1) The buffer creation latency due to global_flush_tlb(). This can be worked around with buffer /page caching in a number of ways (below) and once the wbinvd() is gone from the main kernel it won't be such a huge problem anymore. a) kernel pool of uncached / unmapped (highmem-like) pages. (Not likely to occur anytime soon) b) A pre-bound region of VRAM-like AGP memory for batch-buffers and friends. Easy to set up ands avoids flushing issues altogether. c) User-space bo-caching and reuse. d) User-space buffer pools. TG is heading down the d) path since it also fixes the texture granularity problem. 2) Relocation application. KeithPs presumed_offset stuff has to a great extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just added will take care of the rest, and I hope the mm kernel guys will understand the problem and accept the kmap_atomic_prot_pfn() in. I'm working on a patch that will do post-validation only relocations this way. 3) Streaming reads from GPU to CPU. Use cache-coherent buffers if available, otherwise SGDMA. I'm not sure (due to prefetching) that DRM_BO_FLAG_CACHED_MAPPED addresses this issue correctly. So from my perspective I'd like to keep the default behavior, particularly as we're using d) to address problem 1), and if I understand it correctly, Intel is heading down c). In the long run I'd like to see DRM_BO_FLAG_CACHED_MAPPED disappear, and us fix whatever's in the way for you to implement c). If we need to address this before a kernel inclusion, is there a way we can have that as a driver-specific flag? That would mean adding a driver-specific flag preprocessing callback. >> http://marc.info/?l=linux-kernel&m=102376926732464&w=2 >> >> And this resulted in the change_page_attr() and the dreaded >> global_flush_tlb() kernel calls. From what I understand it might be OK >> for streaming writes to the GPU (like batch-buffers) but how would you >> stop a CPU from prefetching invalid data from a buffer while you're >> writing to it from the GPU? And even write it back, overwriting what the >> GPU just wrote? >> This would break anything trying to use TTM in a consistent way. >> > > As far as we know, Intel CPUs are not affected by the AMD limitation > that read-only speculation may result in later writeback, so what we do > works out. It does look like we're not flushing CPU cache at map time > (bo_map_ioctl -> buffer_object_map -> bo_wait, bo_evict_cached -> > bo_evict -> move_mem), which is wrong. > > Note that in the current implementation, when we map the buffer again, > we unmap it out of the hardware. It would also be nice to not unmap it > from the hardware and leave the GART mapping as-is, and just flush the > cache again when validating. The 3D driver basically never hits this > path at the moment, but the X server certainly would (sadly), and we may > have the 3D driver doing this if we do userland buffer reuse. > Yes, leaving the GART mapping as-is should probably work fine. My concern is a case similar to where you're doing rendering and then needs to do a software fallback. You'll map the destination buffer but have no way of knowing whether the CPU has already speculatively prefetched invalid data into the cached kernel mapping. I guess, in that case, it'll be propagated into the user-space mapping as well? /Thomas > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > -- > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel > |
From: Jerome G. <gl...@fr...> - 2008-02-29 12:19:34
|
On Fri, 29 Feb 2008 11:08:53 +0100 Thomas Hellström <th...@tu...> wrote: > > 2) Relocation application. KeithPs presumed_offset stuff has to a great > extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just > added will take care of the rest, and I hope the mm kernel guys will > understand the problem and accept the kmap_atomic_prot_pfn() in. I'm > working on a patch that will do post-validation only relocations this way. kmap_atomic_prot_pfn change broke drm as FIX_KMAP is defined only for 32bits architecture for x86. Reported on irc. Cheers, Jerome Glisse <gl...@fr...> |
From: Eric A. <er...@an...> - 2008-03-04 20:38:09
|
On Fri, 2008-02-29 at 11:08 +0100, Thomas Hellström wrote: > Eric Anholt wrote: > > On Thu, 2008-02-28 at 10:08 +0100, Thomas Hellström wrote: > > > >> Eric Anholt wrote: > >> > >>> On Thu, 2008-02-28 at 06:08 +1000, Dave Airlie wrote > >> Remove unused DRM_FENCE_FLAG_WAIT_IGNORE_SIGNALS. Yes that's OK. > >> > >> Remove DRM_FENCE_FLAG_NO_USER No. Used by the Poulsbo X server EXA > >> implementation and is quite valuable for small composite operations. > >> > >> Remove DRM_BO_FLAG_CACHED_MAPPED and make that a default behaviour. > >> No!!! We can't do that!!! > >> DRM_BO_FLAG_CACHED_MAPPED is creating an invalid physical page aliasing, > >> the details of which are thoroughly explained here > >> > > > > I may have said it wrong: Make DRM_BO_FLAG_CACHED_MAPPED the default > > behavior if the platform can support it. The point is that it should > > not be userland interface -- if the kernel can manage it, then just do > > it. Otherwise, don't. I'd rather see us disable the performance hack > > for now than leave a go-faster switch in the interface. > > > > Going back over the commit, I didn't make the better behavior > > conditional on the platform being able to do it. Oops, I need to fix > > that. > > > > > Yes, hmm, as I see it there are three performance problems that > DRM_BO_FLAG_CACHED_MAPPED attempts to address: > > 1) The buffer creation latency due to global_flush_tlb(). This can be > worked around with buffer /page caching in a number of ways (below) and > once the wbinvd() is gone from the main kernel it won't be such a huge > problem anymore. > a) kernel pool of uncached / unmapped (highmem-like) pages. (Not likely > to occur anytime soon) > b) A pre-bound region of VRAM-like AGP memory for batch-buffers and > friends. Easy to set up ands avoids flushing issues altogether. > c) User-space bo-caching and reuse. > d) User-space buffer pools. > > TG is heading down the d) path since it also fixes the texture > granularity problem. There's no texture granularity problem on Intel, right, given that we have a fixed mipmap layout? Or have you seen real applications with piles of very small textures to the point where it's an issue? I'm concerned over our code growing in complexity to handle issues that aren't actually issues for our hardware. c) is a very attractive solution as it's a 100 line diff to what we have today (people.freedesktop.org:~anholt/drm intel-buffer-reuse). Looks like something broke the 20% performance win I saw when I last tested this a few weeks ago. But I'd rather just fix the the performance bugs we have in the current DRM implementation so that we potentially don't even need to bother with these hacks. So, !highmem, page-by-page allocation > 2) Relocation application. KeithPs presumed_offset stuff has to a great > extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just > added will take care of the rest, and I hope the mm kernel guys will > understand the problem and accept the kmap_atomic_prot_pfn() in. I'm > working on a patch that will do post-validation only relocations this way. Right now kernel relocation handling is up to 9% of the profile. Unless some really weird inlining is going on, the map calls aren't showing up, which isn't surprising since we never actually do any mapping today (all of our memory is direct-mapped lowmem. I really want to fix that). > 3) Streaming reads from GPU to CPU. Use cache-coherent buffers if > available, otherwise SGDMA. I'm not sure (due to prefetching) that > DRM_BO_FLAG_CACHED_MAPPED addresses this issue correctly. > So from my perspective I'd like to keep the default behavior, > particularly as we're using d) to address problem 1), and if I > understand it correctly, Intel is heading down c). > > In the long run I'd like to see DRM_BO_FLAG_CACHED_MAPPED disappear, and > us fix whatever's in the way for you to implement c). If we need to > address this before a kernel inclusion, is there a way we can have that > as a driver-specific flag? That would mean adding a driver-specific flag > preprocessing callback. Going from CACHED_MAPPED back to uncached even with buffer reuse is a 22% performance drop (but at least it's not the 79% hit to the face you get without buffer reuse). So, I want DRM_BO_FLAG_CACHED_MAPPED to be the default for our DRM when you don't ask for cached buffers, determined by the driver flagging it as doable on this chipset. > >> http://marc.info/?l=linux-kernel&m=102376926732464&w=2 > >> > >> And this resulted in the change_page_attr() and the dreaded > >> global_flush_tlb() kernel calls. From what I understand it might be OK > >> for streaming writes to the GPU (like batch-buffers) but how would you > >> stop a CPU from prefetching invalid data from a buffer while you're > >> writing to it from the GPU? And even write it back, overwriting what the > >> GPU just wrote? > >> This would break anything trying to use TTM in a consistent way. > >> > > > > As far as we know, Intel CPUs are not affected by the AMD limitation > > that read-only speculation may result in later writeback, so what we do > > works out. It does look like we're not flushing CPU cache at map time > > (bo_map_ioctl -> buffer_object_map -> bo_wait, bo_evict_cached -> > > bo_evict -> move_mem), which is wrong. > > > > Note that in the current implementation, when we map the buffer again, > > we unmap it out of the hardware. It would also be nice to not unmap it > > from the hardware and leave the GART mapping as-is, and just flush the > > cache again when validating. The 3D driver basically never hits this > > path at the moment, but the X server certainly would (sadly), and we may > > have the 3D driver doing this if we do userland buffer reuse. > > > Yes, leaving the GART mapping as-is should probably work fine. > My concern is a case similar to where you're doing rendering and then > needs to do a software fallback. > You'll map the destination buffer but have no way of knowing whether the > CPU has already speculatively prefetched invalid data into the cached > kernel mapping. I guess, in that case, it'll be propagated into the > user-space mapping as well? I apparently deleted that part of the "Note that" paragraph above in editing. Yes, we need to also be flushing the CPU cache when we map the buffer after acceleration. But when we hit that case it should no more than double the cost of clflushing we do already when submitting buffers, which is about 2% of the profile at the moment (no additional externalized costs as you shouldn't have any valid CPU caching of that buffer's contents anyway). -- Eric Anholt anholt@FreeBSD.org er...@an... eri...@in... |
From: Thomas H. <th...@tu...> - 2008-03-04 23:25:43
|
Eric Anholt wrote: > >> c) User-space bo-caching and reuse. >> d) User-space buffer pools. >> >> TG is heading down the d) path since it also fixes the texture >> granularity problem. >> > > There's no texture granularity problem on Intel, right, given that we > have a fixed mipmap layout? Or have you seen real applications with > piles of very small textures to the point where it's an issue? I'm > concerned over our code growing in complexity to handle issues that > aren't actually issues for our hardware. > Yes, we've seen such cases, but they've been worked around in the app. > c) is a very attractive solution as it's a 100 line diff to what we have > today (people.freedesktop.org:~anholt/drm intel-buffer-reuse). Looks > like something broke the 20% performance win I saw when I last tested > this a few weeks ago. > > But I'd rather just fix the the performance bugs we have in the current > DRM implementation so that we potentially don't even need to bother with > these hacks. So, !highmem, page-by-page allocation > > >> 2) Relocation application. KeithPs presumed_offset stuff has to a great >> extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just >> added will take care of the rest, and I hope the mm kernel guys will >> understand the problem and accept the kmap_atomic_prot_pfn() in. I'm >> working on a patch that will do post-validation only relocations this way. >> > > Right now kernel relocation handling is up to 9% of the profile. Unless > some really weird inlining is going on, the map calls aren't showing up, > which isn't surprising since we never actually do any mapping today (all > of our memory is direct-mapped lowmem. I really want to fix that). > Yes, you're probably always using the fixed kernel map. The kmap_atomic_prot_pfn() is important for post relocations that are done through the GTT, where the ioremap() otherwise would be nasty. > > Going from CACHED_MAPPED back to uncached even with buffer reuse is a > 22% performance drop (but at least it's not the 79% hit to the face you > get without buffer reuse). > > Hmm, this sounds odd. That simply means you must still be doing a lot of buffer binding / unbinding? Relocations using master drm will also be slow of course. > So, I want DRM_BO_FLAG_CACHED_MAPPED to be the default for our DRM when > you don't ask for cached buffers, determined by the driver flagging it > as doable on this chipset. > I see two problems with this: One is cosmetic in that DRM_BO_FLAG_CACHED_MAPPED doesn't have the same semantics as !DRM_BO_FLAG_CACHED. You can't use it for user-space buffer pools. The other one is that TG will be needing the functionality of !DRM_BO_FLAG_CACHED, so we need to provide a way to have that. I guess you will be needing it as well for things like scanout buffers? /Thomas |
From: Thomas H. <th...@tu...> - 2008-03-05 10:26:04
|
Thomas Hellström wrote: > Eric Anholt wrote: > >> >> Going from CACHED_MAPPED back to uncached even with buffer reuse is a >> 22% performance drop (but at least it's not the 79% hit to the face you >> get without buffer reuse). >> >> > Hmm, this sounds odd. That simply means you must still be doing a lot > of buffer binding / unbinding? Relocations using master drm will also > be slow of course. Oh, and of course one must make sure that display list VBOs don't end up in WC memory if we're doing any software vertex processing. Otherwise apps like gears will read vertex data from WC memory which will kill performance. That means adjusting intel_buffer_objects.c to make sure VBOs end up CACHED. >> So, I want DRM_BO_FLAG_CACHED_MAPPED to be the default for our DRM when >> you don't ask for cached buffers, determined by the driver flagging it >> as doable on this chipset. >> > I see two problems with this: One is cosmetic in that > DRM_BO_FLAG_CACHED_MAPPED doesn't have the same semantics as > !DRM_BO_FLAG_CACHED. You can't use it for user-space buffer pools. ... And thinking more about this, you can't really use it for scanout buffers either, unless you want to clflush() twice and chipset flush for each write operation. > > /Thomas > > /Thomas |
From: Eric A. <er...@an...> - 2008-03-05 18:15:08
|
On Wed, 2008-03-05 at 01:25 +0100, Thomas Hellström wrote: > Eric Anholt wrote: > > > >> c) User-space bo-caching and reuse. > >> d) User-space buffer pools. > >> > >> TG is heading down the d) path since it also fixes the texture > >> granularity problem. > >> > > > > There's no texture granularity problem on Intel, right, given that we > > have a fixed mipmap layout? Or have you seen real applications with > > piles of very small textures to the point where it's an issue? I'm > > concerned over our code growing in complexity to handle issues that > > aren't actually issues for our hardware. > > > Yes, we've seen such cases, but they've been worked around in the app. > > c) is a very attractive solution as it's a 100 line diff to what we have > > today (people.freedesktop.org:~anholt/drm intel-buffer-reuse). Looks > > like something broke the 20% performance win I saw when I last tested > > this a few weeks ago. > > > > But I'd rather just fix the the performance bugs we have in the current > > DRM implementation so that we potentially don't even need to bother with > > these hacks. So, !highmem, page-by-page allocation > > > > > >> 2) Relocation application. KeithPs presumed_offset stuff has to a great > >> extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just > >> added will take care of the rest, and I hope the mm kernel guys will > >> understand the problem and accept the kmap_atomic_prot_pfn() in. I'm > >> working on a patch that will do post-validation only relocations this way. > >> > > > > Right now kernel relocation handling is up to 9% of the profile. Unless > > some really weird inlining is going on, the map calls aren't showing up, > > which isn't surprising since we never actually do any mapping today (all > > of our memory is direct-mapped lowmem. I really want to fix that). > > > Yes, you're probably always using the fixed kernel map. The > kmap_atomic_prot_pfn() is important for post relocations that are done > through the GTT, where the ioremap() otherwise would be nasty. > > > > Going from CACHED_MAPPED back to uncached even with buffer reuse is a > > 22% performance drop (but at least it's not the 79% hit to the face you > > get without buffer reuse). > > > > > Hmm, this sounds odd. That simply means you must still be doing a lot of > buffer binding / unbinding? Relocations using master drm will also be > slow of course. This was just openarena off of master or intel-bo-reuse, with a change to object creation to strip CACHED|CACHED_MAPPED off. The rendering path we see is generally no relocations occurring on buffers that are currently in GTT, and just the batch, curbe, and vertex buffers being new per batchbuffer. I should actually collect numbers to see how often we're relocating into things bound to GTT. > > So, I want DRM_BO_FLAG_CACHED_MAPPED to be the default for our DRM when > > you don't ask for cached buffers, determined by the driver flagging it > > as doable on this chipset. > > > I see two problems with this: One is cosmetic in that > DRM_BO_FLAG_CACHED_MAPPED doesn't have the same semantics as > !DRM_BO_FLAG_CACHED. You can't use it for user-space buffer pools. > The other one is that TG will be needing the functionality of > !DRM_BO_FLAG_CACHED, so we need to provide a way to have that. I guess > you will be needing it as well for things like scanout buffers? It would be fine for scanout buffers if we fixed the fact that we evict the object as a signal that clflushing needs to occur. Then you're just down to two clflushes and a chipset flush per map/unmap of that buffer. The clflushing would be somewhat expensive, but may still beat uncached access through the GTT. -- Eric Anholt anholt@FreeBSD.org er...@an... eri...@in... |
From: Thomas H. <th...@tu...> - 2008-03-05 18:56:31
|
Eric Anholt wrote: >> I see two problems with this: One is cosmetic in that >> DRM_BO_FLAG_CACHED_MAPPED doesn't have the same semantics as >> !DRM_BO_FLAG_CACHED. You can't use it for user-space buffer pools. >> The other one is that TG will be needing the functionality of >> !DRM_BO_FLAG_CACHED, so we need to provide a way to have that. I guess >> you will be needing it as well for things like scanout buffers? >> > > It would be fine for scanout buffers if we fixed the fact that we evict > the object as a signal that clflushing needs to occur. Then you're just > down to two clflushes and a chipset flush per map/unmap of that buffer. > The clflushing would be somewhat expensive, but may still beat uncached > access through the GTT. > Eric, What's extremely important to us is to keep the !DRM_BO_FLAG_CACHED functionality in i915 drm for upcoming Gallium drivers. We see no performance issues with the way transient buffers were handled in the i915tex driver and are planning to reuse that functionality. I'm realizing I'm not going to be able to convince you to do the same, but anyway I'd prefer that !DRM_BO_FLAG_CACHED have the same semantics as in other drivers and for other memory types. If not, we must be able to have a way to select them. /Thomas > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > -- > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel > |