Handling transient data

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

While buffer objects provide a huge benefit for long-term objects, their
benefit for transient data is less clear. Page allocation, cache
flushing and mmap costs are all significant. This missive is designed to
outline some of the potential mechanisms to ameliorate or transfer these
costs to other areas. Please treat it as a suggestion for a list of
experiments to try, not as a directive for how things should be done.

Let's focus on the batch buffer, as that is a fairly universal object in
this realm.

In the Intel driver, batch buffers are handled as follows:

     1. Allocate a buffer object
     2. Map the buffer object to user space
     3. Write data to the object - faulting in new pages
     4. Pass the object to the super-ioctl - allocating remaining pages
     5. Clflush the memory range of the object
     6. Flush the chipset
     7. Map the object to the GTT
     8. When fence passes, free the object

Profiling the Intel driver has shown us that several of these steps are
expensive:

     A. Allocating pages one-at-a-time
     B. Mapping pages to user space
     C. Writing to un-cached buffer object pages
     D. Flushing CPU caches

Two kernel developers suggested to me this week that it would likely be
cheaper to copy data in the kernel than to use mmap and write the data
in user space. Re-using the same buffer in user space would mean that
the stores would be to cached memory (avoiding loading the entire batch
buffer into the cache). Using non-temporal stores would mean that the
copy wouldn't load the destination buffer into cache, only to
immediately flush it back out again with clflush.

They also said that allocating pages one-at-a-time is very inefficient
and that we should be asking for as many as we need up-front, and then
falling back to smaller allocations when the larger allocation is not
available. The kernel groups pages in power-of-two buckets; doing our
allocations atomically would also tend to reduce memory fragmentation
within the kernel. This will not increase memory usage as we already
allocate all of the pages -- there is no benefit to allocating them
slowly.

Eric already experimented with re-using buffer objects from the driver.
Buffers would be allocated in a power-of-two size bucket; when a buffer
was freed, it would be placed on a list of same-sized buffers. At
allocation time, the driver would check the first element of the
appropriate list to see if it was idle. If so, that buffer would be
re-used. Otherwise, a new buffer would be allocated. The results were
impressive -- greater than 20% performance improvement in openarena.
However, this pins a huge amount of memory, and still doesn't avoid the
cache effects from all of the flushing.

So, here's a list of things I think we should try:

1. Allocate all BO pages at create time.

        This will eliminate the page-fault overhead, clump DRM pages
        together avoiding memory fragmentation and reduce the cost of
        allocation. The 965 driver is spending 20-30% of the CPU
        allocating pages currently.

2. Add an ioctl for CopyBOSubData.

        Create a buffer in user space to hold batch-buffer contents that
        is re-used for each new batch. When the batch is full, copy it
        to the buffer object with this kernel call. Use non-temporal
        stores to avoid bringing the destination buffer into cache.

3. Use the GTT for CopyBOSubData

        I wonder if doing the CopyBOSubData through the GTT would be
        more efficient. It would eliminate the need to flush the
        chipset, and would also avoid any question of whether
        non-temporal stores would flush data all the way to the chipset.

This would change buffer management to:

     1. Allocate a buffer object and all of the pages
     2. Write data to a user-space buffer
     3. Map the object to the GTT
     4. Copy data to the buffer object
     5. Pass the object to the super-ioctl
     6. When the fence passes, free the object

-- 
kei...@in...