On Fri, 2008-02-29 at 11:08 +0100, Thomas Hellström wrote:
> Eric Anholt wrote:
> > On Thu, 2008-02-28 at 10:08 +0100, Thomas Hellström wrote:
> >> Eric Anholt wrote:
> >>> On Thu, 2008-02-28 at 06:08 +1000, Dave Airlie wrote
> >> Remove unused DRM_FENCE_FLAG_WAIT_IGNORE_SIGNALS. Yes that's OK.
> >> Remove DRM_FENCE_FLAG_NO_USER No. Used by the Poulsbo X server EXA
> >> implementation and is quite valuable for small composite operations.
> >> Remove DRM_BO_FLAG_CACHED_MAPPED and make that a default behaviour.
> >> No!!! We can't do that!!!
> >> DRM_BO_FLAG_CACHED_MAPPED is creating an invalid physical page aliasing,
> >> the details of which are thoroughly explained here
> > I may have said it wrong: Make DRM_BO_FLAG_CACHED_MAPPED the default
> > behavior if the platform can support it. The point is that it should
> > not be userland interface -- if the kernel can manage it, then just do
> > it. Otherwise, don't. I'd rather see us disable the performance hack
> > for now than leave a go-faster switch in the interface.
> > Going back over the commit, I didn't make the better behavior
> > conditional on the platform being able to do it. Oops, I need to fix
> > that.
> Yes, hmm, as I see it there are three performance problems that
> DRM_BO_FLAG_CACHED_MAPPED attempts to address:
> 1) The buffer creation latency due to global_flush_tlb(). This can be
> worked around with buffer /page caching in a number of ways (below) and
> once the wbinvd() is gone from the main kernel it won't be such a huge
> problem anymore.
> a) kernel pool of uncached / unmapped (highmem-like) pages. (Not likely
> to occur anytime soon)
> b) A pre-bound region of VRAM-like AGP memory for batch-buffers and
> friends. Easy to set up ands avoids flushing issues altogether.
> c) User-space bo-caching and reuse.
> d) User-space buffer pools.
> TG is heading down the d) path since it also fixes the texture
> granularity problem.
There's no texture granularity problem on Intel, right, given that we
have a fixed mipmap layout? Or have you seen real applications with
piles of very small textures to the point where it's an issue? I'm
concerned over our code growing in complexity to handle issues that
aren't actually issues for our hardware.
c) is a very attractive solution as it's a 100 line diff to what we have
today (people.freedesktop.org:~anholt/drm intel-buffer-reuse). Looks
like something broke the 20% performance win I saw when I last tested
this a few weeks ago.
But I'd rather just fix the the performance bugs we have in the current
DRM implementation so that we potentially don't even need to bother with
these hacks. So, !highmem, page-by-page allocation
> 2) Relocation application. KeithPs presumed_offset stuff has to a great
> extent fixed this problem. I think the kmap_atomic_prot_pfn() stuff just
> added will take care of the rest, and I hope the mm kernel guys will
> understand the problem and accept the kmap_atomic_prot_pfn() in. I'm
> working on a patch that will do post-validation only relocations this way.
Right now kernel relocation handling is up to 9% of the profile. Unless
some really weird inlining is going on, the map calls aren't showing up,
which isn't surprising since we never actually do any mapping today (all
of our memory is direct-mapped lowmem. I really want to fix that).
> 3) Streaming reads from GPU to CPU. Use cache-coherent buffers if
> available, otherwise SGDMA. I'm not sure (due to prefetching) that
> DRM_BO_FLAG_CACHED_MAPPED addresses this issue correctly.
> So from my perspective I'd like to keep the default behavior,
> particularly as we're using d) to address problem 1), and if I
> understand it correctly, Intel is heading down c).
> In the long run I'd like to see DRM_BO_FLAG_CACHED_MAPPED disappear, and
> us fix whatever's in the way for you to implement c). If we need to
> address this before a kernel inclusion, is there a way we can have that
> as a driver-specific flag? That would mean adding a driver-specific flag
> preprocessing callback.
Going from CACHED_MAPPED back to uncached even with buffer reuse is a
22% performance drop (but at least it's not the 79% hit to the face you
get without buffer reuse).
So, I want DRM_BO_FLAG_CACHED_MAPPED to be the default for our DRM when
you don't ask for cached buffers, determined by the driver flagging it
as doable on this chipset.
> >> http://marc.info/?l=linux-kernel&m=102376926732464&w=2
> >> And this resulted in the change_page_attr() and the dreaded
> >> global_flush_tlb() kernel calls. From what I understand it might be OK
> >> for streaming writes to the GPU (like batch-buffers) but how would you
> >> stop a CPU from prefetching invalid data from a buffer while you're
> >> writing to it from the GPU? And even write it back, overwriting what the
> >> GPU just wrote?
> >> This would break anything trying to use TTM in a consistent way.
> > As far as we know, Intel CPUs are not affected by the AMD limitation
> > that read-only speculation may result in later writeback, so what we do
> > works out. It does look like we're not flushing CPU cache at map time
> > (bo_map_ioctl -> buffer_object_map -> bo_wait, bo_evict_cached ->
> > bo_evict -> move_mem), which is wrong.
> > Note that in the current implementation, when we map the buffer again,
> > we unmap it out of the hardware. It would also be nice to not unmap it
> > from the hardware and leave the GART mapping as-is, and just flush the
> > cache again when validating. The 3D driver basically never hits this
> > path at the moment, but the X server certainly would (sadly), and we may
> > have the 3D driver doing this if we do userland buffer reuse.
> Yes, leaving the GART mapping as-is should probably work fine.
> My concern is a case similar to where you're doing rendering and then
> needs to do a software fallback.
> You'll map the destination buffer but have no way of knowing whether the
> CPU has already speculatively prefetched invalid data into the cached
> kernel mapping. I guess, in that case, it'll be propagated into the
> user-space mapping as well?
I apparently deleted that part of the "Note that" paragraph above in
editing. Yes, we need to also be flushing the CPU cache when we map the
buffer after acceleration. But when we hit that case it should no more
than double the cost of clflushing we do already when submitting
buffers, which is about 2% of the profile at the moment (no additional
externalized costs as you shouldn't have any valid CPU caching of that
buffer's contents anyway).
Eric Anholt anholt@...