From: Keith P. <ke...@ke...> - 2008-03-01 17:56:11
|
While buffer objects provide a huge benefit for long-term objects, their benefit for transient data is less clear. Page allocation, cache flushing and mmap costs are all significant. This missive is designed to outline some of the potential mechanisms to ameliorate or transfer these costs to other areas. Please treat it as a suggestion for a list of experiments to try, not as a directive for how things should be done. Let's focus on the batch buffer, as that is a fairly universal object in this realm. In the Intel driver, batch buffers are handled as follows: 1. Allocate a buffer object 2. Map the buffer object to user space 3. Write data to the object - faulting in new pages 4. Pass the object to the super-ioctl - allocating remaining pages 5. Clflush the memory range of the object 6. Flush the chipset 7. Map the object to the GTT 8. When fence passes, free the object Profiling the Intel driver has shown us that several of these steps are expensive: A. Allocating pages one-at-a-time B. Mapping pages to user space C. Writing to un-cached buffer object pages D. Flushing CPU caches Two kernel developers suggested to me this week that it would likely be cheaper to copy data in the kernel than to use mmap and write the data in user space. Re-using the same buffer in user space would mean that the stores would be to cached memory (avoiding loading the entire batch buffer into the cache). Using non-temporal stores would mean that the copy wouldn't load the destination buffer into cache, only to immediately flush it back out again with clflush. They also said that allocating pages one-at-a-time is very inefficient and that we should be asking for as many as we need up-front, and then falling back to smaller allocations when the larger allocation is not available. The kernel groups pages in power-of-two buckets; doing our allocations atomically would also tend to reduce memory fragmentation within the kernel. This will not increase memory usage as we already allocate all of the pages -- there is no benefit to allocating them slowly. Eric already experimented with re-using buffer objects from the driver. Buffers would be allocated in a power-of-two size bucket; when a buffer was freed, it would be placed on a list of same-sized buffers. At allocation time, the driver would check the first element of the appropriate list to see if it was idle. If so, that buffer would be re-used. Otherwise, a new buffer would be allocated. The results were impressive -- greater than 20% performance improvement in openarena. However, this pins a huge amount of memory, and still doesn't avoid the cache effects from all of the flushing. So, here's a list of things I think we should try: 1. Allocate all BO pages at create time. This will eliminate the page-fault overhead, clump DRM pages together avoiding memory fragmentation and reduce the cost of allocation. The 965 driver is spending 20-30% of the CPU allocating pages currently. 2. Add an ioctl for CopyBOSubData. Create a buffer in user space to hold batch-buffer contents that is re-used for each new batch. When the batch is full, copy it to the buffer object with this kernel call. Use non-temporal stores to avoid bringing the destination buffer into cache. 3. Use the GTT for CopyBOSubData I wonder if doing the CopyBOSubData through the GTT would be more efficient. It would eliminate the need to flush the chipset, and would also avoid any question of whether non-temporal stores would flush data all the way to the chipset. This would change buffer management to: 1. Allocate a buffer object and all of the pages 2. Write data to a user-space buffer 3. Map the object to the GTT 4. Copy data to the buffer object 5. Pass the object to the super-ioctl 6. When the fence passes, free the object -- kei...@in... |