From: Ian R. <id...@us...> - 2003-01-17 01:33:53
|
What follows is the collected requirements for the new DRI memory manager. This list is the product of several discussions between Brian, Keith, Allen, and myself several months ago. After the list, I have included some of my thoughts on the big picture that I see from these requirements. 1. Single-copy textures Right now each texture exists in two or three places. There is a copy in on-card or AGP memory, in system memory (managed by the driver), and in application memory. Any solution should be able to eliminate one or two of those copies. If the driver-tracked copy in system memory is eliminated, care must be taken when the texture needs to be removed from on-card / AGP memory. Additionally, changes to the texture image made via glCopyTexImage must not be lost. It may be possible to eliminate one copy of the texture using APPLE_client_storage. A portion of this could be done purely in Mesa. If the user supplied image matches the internal format of the texture, then the driver can use the application's copy of the texture in place of the driver's copy. Modulo implementation difficulties, it may even be possible to use the pages that hold the texture as backing store for a portion of the AGP aperture. The is the only way to truly achieve single-copy textures. The implementation may prove too difficult on existing x86 systems to be worth the effort. This functionality is available in MacOS 10.1, so the same difficulties may not exist on Linux PPC. 2. Share texture memory among multiple OpenGL contexts Texture memory is currently shared by all OpenGL contexts. That is, when an OpenGL context switch happens it is not necessary to reload all textures. The texture manager needs to continue to use a paged memory model (as opposed to a segmented memory model). 3. Accommodate other OpenGL buffers The allocator should also be used for allocating vertex buffers, render targets (pbuffers, back-buffers, depth-buffers, etc.), and other buffers. This can be useful beyond supporting SGIX_pbuffer, ARB_vertex_array_objects, and optimized display lists. Dynamically allocating per-context depth and back-buffers will allow multiple Z depths be used at a time (i.e., 16-bit depth-buffer for one window and 24-bit depth-buffer for another) and super-sampling FSAA. 4. Support texture pseudo-render targets Accelerating some OpenGL functions, such as glCopyTexImage, SGIS_generate_mipmaps, and ARB_render_texture, may require special support and consideration. 5. Additional AGP related issues There may be cases where textures need to be moved back-and-forth between AGP and on-card memory. For example, a texture might reside in AGP memory, and an operation may be requested that requires that the texture be in on-card memory. 6. Additional texture formats and layouts Compressed, 1D, 3D, cube map, and non-power-of-two textures need to be supported in addition to "traditional" 2D power-of-two textures. 7. Allen Akin's pinned-texture proposal If we ever expose memory management to the user (beyond texture priorities) we want to be sure our allocator is designed with this in mind. 8. Device independence As much as possible, the source code for the memory manager should live somewhere device independent. This is both for the benefit of newly developed drivers and for maintaining existing drivers. * My Thoughts * There are really only two radical departures from the existing memory manager. The first is using the memory manager for non-texture memory objects. The second, which is partially a result of the first, is the need to "pin" objects. It would not do to have one context kick another context's depth-buffer out of memory! My initial thought on how to accomplish this was to move the allocator into the kernel. There would be a low-level allocator that could be used for non-texture buffers and a way to create textures (from data). In the texture case, the kernel would only allocate memory when a texture was used. In stead of using the actual texture address in drawing command streams, the user-level driver would insert texture IDs. The kernel would use these IDs to map to real texture addresses. The benefit is that all memory management would be handled by a single omniscient execution context (the kernel). The downside is that it would move a LOT of code into the kernel. It would be almost entirely OS and device independent, but there would likely be a lot of it. After talking with Jeff Hartmann in IRC on 1/13, I started thinking about all of this again. Jeff had some serious reservations about moving that volume of code into the kernel, and he believed that all of the requirements could be met by a purely user-space implementation. After thinking about things some more, I'm starting to agree. What follows is a fairly random series of thoughts on how a user-space memory manager could be made to work. I believe that everything could be done by breaking each memory space down into blocks (as is currently done) and tracking two values, either implicitly or explicitly, with each block. The first value is some sort of swap-out priority. This is currently implicitly tracked by the list ordering in the SAREA. The other value is basically a semaphore, but it could be implemented as a simple can-swap bit. Blocks that have active depth-buffer would never have can-swap set. Blocks that have "normal" textures, back-buffer, render-target textures, and puffers would have their can-swap bit conditionally set. Each of these types of blocks would have the can-swap bit cleared under the following situations: - Normal textures - While a rendering operation is queued that will use the texture. - SGIS_generate_mipmaps textures - While the blits are in progress to create the filtered mipmaps. - glCopyTexImage textures - While the blit to copy image data to the texture is in progress and while the data in the texture has not been copied to some sort of backing store. - pbuffers - While rendering operations to the pbuffer are in progress. pbuffers have a mechanism to tell an application when the contents of the pbuffer have been "lost." This could be exploited by the memory manager. One caveat is when a pbuffer is bound to a texture (ARB_render_texture). While the pbuffer is bound to a texture, its contents cannot be lost. Can the contents be "swapped out" to some sort of backing store, like with glCopyTexImage targets? - Back-buffers - In unextended GLX, back-buffers can never be swapped. However, if OML_sync_control is available, a "double buffered" visual may want to have many virtual back-buffers. Each time glXSwapBuffersMscOML (essentially an asynchronous glXSwapBuffers call) is made, a new back-buffer is allocated as the rendering target. Once a back-buffer is copied to the front-buffer (i.e., the queued buffer-swap completes), the back-buffer can be swapped-out. There may be other situations where can-swap is cleared, but that's all I could think of. Similar rules would exist for vertex buffers (for ARB_vertex_array_object, EXT_compiled_vertex_array, optimized display lists, etc.). Only a single bit per block is needed in the SAREA. That bit is the union of the bits for each object that is part of that block. This union must be calculated by the user-space driver. This presents a possible problem of user-space clients failing to update the can-swap bits for some reason (process hung on blocking IO call?). The current implementation avoids this problem by forcing all bocks to be swappable at all times. At this point I'm left with a few questions. 1. In a scheme like this, how could processes be forced to update the can-swap bits on blocks that they own? 2. What is the best way for processes to be notified of events that could cause can-swap bits to change (i.e., rendering completion, asynchronous buffer-swap completion, etc.)? Signals from the kernel? Polling "age" variables? 3. If some sort of signal based notification is used, could it be used to implement NV_fence and / or APPLE_fence? 4. How could the memory manager handle objects that span multiple blocks? In other words, could the memory manager be made to prefer to swap-out blocks that wholly contain all of the objects that overlap the block? Are there other useful metrics? Prefer to swap-out blocks that are half full over blocks that are completely full? 5. What other things I have I missed that might prevent this system from working? :) |
From: magenta <ma...@tr...> - 2003-01-17 06:16:35
|
On Thu, Jan 16, 2003 at 05:33:42PM -0800, Ian Romanick wrote: > > 1. In a scheme like this, how could processes be forced to update the > can-swap bits on blocks that they own? Should it even be possible for one process to swap out other processes' context data? Alternatively (forgive me if this sounds a bit naive), could the swapping be handled by agpgart, which just changes the memory mapping of the allocated pages in the background? Sort of an added VM layer, only it would swap to system memory (which could then be swapped to disk)... > 2. What is the best way for processes to be notified of events that > could cause can-swap bits to change (i.e., rendering completion, > asynchronous buffer-swap completion, etc.)? Signals from the kernel? > Polling "age" variables? I'd lean towards signals, myself, though then that leads to possible problems with libGL using a signal which an application wants to use... Or would it be capable of defining new signals? (I'm not up to speed on how that part of the kernel works. Would it just be as simple as adding a new value to an enumeration?) > 4. How could the memory manager handle objects that span multiple > blocks? In other words, could the memory manager be made to prefer > to swap-out blocks that wholly contain all of the objects that > overlap the block? Are there other useful metrics? Prefer to > swap-out blocks that are half full over blocks that are completely > full? If the AGP layer were to treat it like a VM layer and the page size were small (say, 4K) I don't think this would be an issue. -- http://trikuare.cx |
From: Allen A. <ak...@po...> - 2003-01-17 07:14:51
|
On Thu, Jan 16, 2003 at 10:16:30PM -0800, magenta wrote: | | Should it even be possible for one process to swap out other processes' | context data? In the same way that one process can cause the ordinary memory pages of another process to be swapped out, I'd say "yes." As the old saying goes, "Virtual memory is a technique that makes a lot of memory look like a lot of memory." :-) The same holds true for OpenGL context data (especially textures). Allen |
From: magenta <ma...@tr...> - 2003-01-17 07:49:51
|
On Thu, Jan 16, 2003 at 11:03:21PM -0800, Allen Akin wrote: > On Thu, Jan 16, 2003 at 10:16:30PM -0800, magenta wrote: > | > | Should it even be possible for one process to swap out other processes' > | context data? > > In the same way that one process can cause the ordinary memory pages of > another process to be swapped out, I'd say "yes." I'd personally take the school of thought that if the user is running a game which takes up 60MB of texture memory and then tries to concurrently launch something which takes up another 60MB of texture memory, it's their own fault that the other thing can only get 20MB. :) But I do think that treating AGP as another layer of traditional memory-mapped VM in kernel-space (just having agpgart handle the memory mapping, perhaps alongside or on top of the kernel's VM) would be the most elegant solution. Heh, I had another thought which seems perversely sick and wrong, yet oh so right: make video memory get treated as normal memory pages, and just migrate stuff into those pages when it's needed. Then when video memory isn't in use, the kernel could migrate other stuff into video RAM. Unified memory for all! :) > As the old saying goes, "Virtual memory is a technique that makes a lot > of memory look like a lot of memory." :-) The same holds true for > OpenGL context data (especially textures). I'll have to remember that one. :) -- http://trikuare.cx |
From: Dieter <Die...@ha...> - 2003-01-17 16:32:22
|
Am Freitag, 17. Januar 2003 08:42 schrieb magenta: > On Thu, Jan 16, 2003 at 11:03:21PM -0800, Allen Akin wrote: > > On Thu, Jan 16, 2003 at 10:16:30PM -0800, magenta wrote: > > | Should it even be possible for one process to swap out other processes' > > | context data? > > > > In the same way that one process can cause the ordinary memory pages of > > another process to be swapped out, I'd say "yes." [-] > Heh, I had another thought which seems perversely sick and wrong, yet oh so > right: make video memory get treated as normal memory pages, and just > migrate stuff into those pages when it's needed. Then when video memory > isn't in use, the kernel could migrate other stuff into video RAM. Unified > memory for all! :) Sorry, but I think this _is_ perversely sick and wrong ;-) Remember what all goes on in *BSD and Linux to get all out of the "real" memory performance e.g. memcopy and friends. Have a closer look, here: [CFT] faster athlon/duron memory copy implementation http://marc.theaimsgroup.com/?l=linux-kernel&m=103548024914815&w=2 So if you have to few memory buy some modules. Regards, Dieter |
From: Allen A. <ak...@po...> - 2003-01-17 19:26:07
|
On Thu, Jan 16, 2003 at 11:42:31PM -0800, magenta wrote: | I'd personally take the school of thought that if the user is running a | game which takes up 60MB of texture memory and then tries to concurrently | launch something which takes up another 60MB of texture memory, it's their | own fault that the other thing can only get 20MB. :) That's perfectly reasonable behavior on a game console, or on some special-purpose systems (like avionics). For a general-purpose desktop, it's nice to virtualize texture memory. That way everything continues to run (though it may be slow), just like with ordinary user processes. OpenGL certainly needs to give apps more control over memory management. There have been some proposals and extensions for that in the past, and the GL2 working-group is planning new ones for the future. Allen |
From: magenta <ma...@tr...> - 2003-01-17 21:13:05
|
On Fri, Jan 17, 2003 at 11:26:02AM -0800, Allen Akin wrote: > On Thu, Jan 16, 2003 at 11:42:31PM -0800, magenta wrote: > | I'd personally take the school of thought that if the user is running a > | game which takes up 60MB of texture memory and then tries to concurrently > | launch something which takes up another 60MB of texture memory, it's their > | own fault that the other thing can only get 20MB. :) > > That's perfectly reasonable behavior on a game console, or on some > special-purpose systems (like avionics). For a general-purpose desktop, > it's nice to virtualize texture memory. That way everything continues > to run (though it may be slow), just like with ordinary user processes. Good point. I hadn't thought of the case of a GL-composited desktop environment like Jens had posted about, for example... My intent was that the applications would still *run*, they just wouldn't have much texture memory available. Though yeah, being able to page out other applications' video memory allocations would definitely be a good thing. > OpenGL certainly needs to give apps more control over memory management. > There have been some proposals and extensions for that in the past, and > the GL2 working-group is planning new ones for the future. -- http://trikuare.cx |
From: Ian R. <id...@us...> - 2003-01-17 15:01:14
|
magenta wrote: > On Thu, Jan 16, 2003 at 05:33:42PM -0800, Ian Romanick wrote: > >>1. In a scheme like this, how could processes be forced to update the >> can-swap bits on blocks that they own? > > Should it even be possible for one process to swap out other processes' > context data? Alternatively (forgive me if this sounds a bit naive), could > the swapping be handled by agpgart, which just changes the memory mapping > of the allocated pages in the background? Sort of an added VM layer, only > it would swap to system memory (which could then be swapped to disk)... Changing which physical pages back the AGP mapping would help, but you have to remember that the memory manager also manages on-card memory. If back-buffers and depth-buffers are managed the same way, you could imagine that an application could use all of the on-card memory and prevent another context from being able to allocate a back-buffer. >>2. What is the best way for processes to be notified of events that >> could cause can-swap bits to change (i.e., rendering completion, >> asynchronous buffer-swap completion, etc.)? Signals from the kernel? >> Polling "age" variables? > > I'd lean towards signals, myself, though then that leads to possible > problems with libGL using a signal which an application wants to use... Or > would it be capable of defining new signals? (I'm not up to speed on how > that part of the kernel works. Would it just be as simple as adding a new > value to an enumeration?) This is a problem that I ran into very quickly when I started thinking about adding support for asynchronous buffer-swaps. I think we'd have to do something with real-time signals, but my brain refuses to remember how all that works. >>4. How could the memory manager handle objects that span multiple >> blocks? In other words, could the memory manager be made to prefer >> to swap-out blocks that wholly contain all of the objects that >> overlap the block? Are there other useful metrics? Prefer to >> swap-out blocks that are half full over blocks that are completely >> full? > > If the AGP layer were to treat it like a VM layer and the page size were > small (say, 4K) I don't think this would be an issue. That may not be possible. Right now the blocks are tracked in the SAREA, and that puts an upper limit on the number of block available. On a 64MB memory region, the current memory manager ends up with 64KB blocks, IIRC. As memories get bigger (both on-card and AGP apertures), the blocks will get bigger. Also right now each block only requires 4 bytes in the SAREA. Any changes that would be made for a new memory manager would make each block require more space, thereby reducing the number of blocks that could fit in the SAREA. Even if we increase the size of the SAREA, a system with 128MB of on-card memory and 128MB AGP aperture would require ~65000 blocks (if each block covered 4KB). |
From: Jeff H. <jha...@ad...> - 2003-01-17 19:01:09
|
Ian, I've looked through your general proposal and it looks really good. Here are some implementation things I've been thinking about. > That may not be possible. Right now the blocks are tracked in the > SAREA, and that puts an upper limit on the number of block available. > On a 64MB memory region, the current memory manager ends up with 64KB > blocks, IIRC. As memories get bigger (both on-card and AGP apertures), > the blocks will get bigger. Also right now each block only requires 4 > bytes in the SAREA. Any changes that would be made for a new memory > manager would make each block require more space, thereby reducing the > number of blocks that could fit in the SAREA. > Even if we increase the size of the SAREA, a system with 128MB of > on-card memory and 128MB AGP aperture would require ~65000 blocks (if > each block covered 4KB). Don't worry too much about this, we can create an entirely new SAREA to hold the memory manager. It can also be rather large, I'm thinking about 128KB or so wouldn't be a problem at all. This will be non swappable memory, but thats not too big a deal. Here is what I'm thinking of as the general block format right now, it might not be perfect: #define BLOCK_CAN_SWAP (1<<0) #define BLOCK_LINKS_TO_NEXT (1<<1) #define BLOCK_CAN_BE_CLOBBERED (1<<2) #define BLOCK_IS_CACHABLE (1<<3) #define BLOCK_LOG2_USAGE_MASK ((1<<4)|(1<<5)|(1<<6)|(1<<7)) #define BLOCK_LOG2_USAGE_SHIFT (4) #define GET_BLOCK_LOG2_USAGE(status) ((((status) & BLOCK_LOG2_USAGE_MASK) >> BLOCK_LOG2_USAGE_SHIFT) + 1) #define PACK_BLOCK_LOG2_USAGE(log2) (((log2 - 1) << BLOCK_LOG2_USAGE_SHIFT) & BLOCK_LOG2_USAGE_MASK) #define BLOCK_ID_SHIFT 8 #define BLOCK_ID_MASK ((1<<27)|(1<<26)|(1<<25)|(1<<24)|(1<<23)|(1<<22)|(1<<21)|(1<<20)|(1<<19)|(1< <18)|(1<<17)|(1<<16)|(1<<15)|(1<<14)|(1<<13)|(1<<12)|(1<<11)|(1<<10)|(1<<9)| (1<<8)) #define PACK_BLOCK_ID(x) ((x << BLOCK_ID_SHIFT) & BLOCK_ID_MASK) struct memory_block { u32 age_variable; u32 status; }; Where the age variable is device dependant, but I would imagine in most cases is a monotonically increasing unsigned 32-bit number. There needs to be a device driver function to check if an age has happened on the hardware. The status variable has some room, only the bottom 28-bits are defined at the moment. The first 4 bits are some status bits. If BLOCK_CAN_SWAP is set, we can swap this block, swapping requires the driver to call the kernel to swap out this block using some agp method where the contents are preserved. Can be accomplished by card DMA. If BLOCK_LINKS_TO_NEXT is set we are part of a group of blocks, which must be treated as a unit. If BLOCK_CAN_BE_CLOBBERED is set, the driver can just overwrite this block of memory. If BLOCK_IS_CACHABLE is set we can readback from this block in a fast way, so fallbacks can directly use this block. The BLOCK_LOG2 stuff is a way to pack the usage of this block of memory in just a few bits. We pack log2 - 1, where we only accept usages of 2 bytes or more. Using 2 bytes could be considered empty. We can store upto block usage sizes of 64k in this manner. I think that we want 64kb to be our maximum size for a block. The bits 27:8 would be a 20-bit number representing a block id. Each one would be unique, so the driver could keep track of what blocks represent a texture. A 20-bit number should be sufficent, since that gives us like 2 million values to work with. This is a pretty good start for a block format I think. We want to make the memory management SAREA have a lock of its own, shouldn't be a big deal to extend the drm to provide us with one. Or perhaps we use the normal device lock when we do any management, I haven't decided yet. There are some issues to really think about here. This sort of implementation needs the kernel to be able to swap out a block from agp memory. The kernel should reserve a portion of the agp aperture for this purpose. Probably on the order of 2-4 MB. Each allocation of the agp aperture should be no smaller then 1MB in size, to prevent agpgart from having to deal with too many blocks of memory. It will also have to be no smaller then the agp_page_shift, in case someone is using 4MB agp pages. The kernel will blit with a card specified function the designated block from its current position to its final position in the block of agp memory to be swapped. When the ENTIRE block is full, then the kernel will call agpgart to swap that region out of the agp aperture. The kernel will keep track of what each swapped out block contains in some manner, or might brute force scan the shared memory area containing the swapped out blocks. There will be a non backed shared memory area that contains all the swapped out pages, the swapped pool it probably a good thing to call it. Basically its a shared memory area, of say 1MB in size that doesn't have any pages backing it. It will have a kernel no page function that populates it if needed. Basically it will only have information in it if things are swapped out of the aperture. There needs to be a kernel function which moves a block of memory into cacheable space. We could do with with PCI dma, or some magic conversion of unbound agp pages. This could be made safe, and wouldn't be a big deal with the new agpgart vm stuff. That way the block of agp memory could be accessed by a fallback or some other function that needs to directly read the texture. Readback from normal agp memory is horrible, something on the order of 60MB/sec. Those are my implementation thoughts, pretty much a rehash of some of the things I wrote about the subject while at VA Linux. Feel free to poke holes through everything and make recommendations on design. I think this sort of direction should do what we need, but might need plenty of revision. Cheers, -Jeff |
From: Ian R. <id...@us...> - 2003-01-17 20:10:08
|
Jeff Hartmann wrote: >>That may not be possible. Right now the blocks are tracked in the >>SAREA, and that puts an upper limit on the number of block available. >>On a 64MB memory region, the current memory manager ends up with 64KB >>blocks, IIRC. As memories get bigger (both on-card and AGP apertures), >>the blocks will get bigger. Also right now each block only requires 4 >>bytes in the SAREA. Any changes that would be made for a new memory >>manager would make each block require more space, thereby reducing the >>number of blocks that could fit in the SAREA. >> >>Even if we increase the size of the SAREA, a system with 128MB of >>on-card memory and 128MB AGP aperture would require ~65000 blocks (if >>each block covered 4KB). > > Don't worry too much about this, we can create an entirely new SAREA to > hold the memory manager. It can also be rather large, I'm thinking about > 128KB or so wouldn't be a problem at all. This will be non swappable > memory, but thats not too big a deal. Here is what I'm thinking of as the > general block format right now, it might not be perfect: That works. It should also be possible to have it vary its size depending on the amount of memory to be managed. [code segment snipped] > struct memory_block { > u32 age_variable; > u32 status; > }; > > Where the age variable is device dependant, but I would imagine in most > cases is a monotonically increasing unsigned 32-bit number. There needs to > be a device driver function to check if an age has happened on the hardware. I don't think having an age variable in the shared area is necessary or sufficient. That's what my original can-swap bit was all about. Each item that is in a block would have its own age variable / fence. When all of the age variable / fence conditions were satisfied, the can-swap bit would be set. > The status variable has some room, only the bottom 28-bits are defined at > the moment. The first 4 bits are some status bits. If BLOCK_CAN_SWAP is > set, we can swap this block, swapping requires the driver to call the kernel > to swap out this block using some agp method where the contents are > preserved. Can be accomplished by card DMA. If BLOCK_LINKS_TO_NEXT is set > we are part of a group of blocks, which must be treated as a unit. If > BLOCK_CAN_BE_CLOBBERED is set, the driver can just overwrite this block of > memory. If BLOCK_IS_CACHABLE is set we can readback from this block in a > fast way, so fallbacks can directly use this block. That's interesting. I hadn't considered having kernel intervention to actually page out blocks. I had alway been on the assumption that all blocks in AGP or on-card memory were either locked or throw-away. Just like with regular virtual memory, I think we only need to "page out" pages that we're going to use. I don't think we should need to page out an entire set of linked pages. Initially we may want to, though. It wouldn't help much with on-card memory, but with AGP memory (where we can change mappings), we should be able to do some tricks to avoid having to do full re-loads. It's also possible that only a subset of the blocks belonging to an object will have been modified. Perhaps what we really need to know for each block is: 1. Is the block modified (i.e., by glCopyTexImage)? 2. What pages in system memory back the block? That is, where are the parts of the texture in system memory that represent the block in AGP / on-card memory? Hmm...starts to fell like a regular virtual memory system... > The BLOCK_LOG2 stuff is > a way to pack the usage of this block of memory in just a few bits. We pack > log2 - 1, where we only accept usages of 2 bytes or more. Using 2 bytes > could be considered empty. We can store upto block usage sizes of 64k in > this manner. I think that we want 64kb to be our maximum size for a block. That's probably finer granularity than we need. We could probably get away with "empty", "mostly empty", "half full", "mostly full", and "full". Admittedly, that only saves one bit, but it removes the 64KB limit. One thing this is missing is some way to prioritize which blocks are to be swapped out. Right now the blocks are stored in a LRU linked list, but I don't think that's necessarilly the best way (the explicit linked list) to go. > The bits 27:8 would be a 20-bit number representing a block id. Each one > would be unique, so the driver could keep track of what blocks represent a > texture. A 20-bit number should be sufficent, since that gives us like 2 > million values to work with. > > This is a pretty good start for a block format I think. We want to make > the memory management SAREA have a lock of its own, shouldn't be a big deal > to extend the drm to provide us with one. Or perhaps we use the normal > device lock when we do any management, I haven't decided yet. There are > some issues to really think about here. > > This sort of implementation needs the kernel to be able to swap out a block > from agp memory. The kernel should reserve a portion of the agp aperture > for this purpose. Probably on the order of 2-4 MB. Each allocation of the > agp aperture should be no smaller then 1MB in size, to prevent agpgart from > having to deal with too many blocks of memory. It will also have to be no > smaller then the agp_page_shift, in case someone is using 4MB agp pages. > The kernel will blit with a card specified function the designated block > from its current position to its final position in the block of agp memory > to be swapped. When the ENTIRE block is full, then the kernel will call > agpgart to swap that region out of the agp aperture. The kernel will keep > track of what each swapped out block contains in some manner, or might brute > force scan the shared memory area containing the swapped out blocks. Okay. There's a few details of this that I'm not seeing. I'm sure they're there, I'm just not seeing them. Process A needs to allocate some blocks (or even just a single block) for a texture. It scans the list of blocks and finds that not enough free blocks are available. It performs some hokus-pokus and determines that a block "owned" by process B needs to be freed. That block has th BLOCK_CAN_SWAP bit set, but the BLOCK_CAN_BE_CLOBBERED bit is cleared. Process A asks the kernel to page the block out. Then what? How does process B find out that its block was stolen and page it back in? > There will be a non backed shared memory area that contains all the swapped > out pages, the swapped pool it probably a good thing to call it. Basically > its a shared memory area, of say 1MB in size that doesn't have any pages > backing it. It will have a kernel no page function that populates it if > needed. Basically it will only have information in it if things are swapped > out of the aperture. > > There needs to be a kernel function which moves a block of memory into > cacheable space. We could do with with PCI dma, or some magic conversion of > unbound agp pages. This could be made safe, and wouldn't be a big deal with > the new agpgart vm stuff. That way the block of agp memory could be > accessed by a fallback or some other function that needs to directly read > the texture. Readback from normal agp memory is horrible, something on the > order of 60MB/sec. The conversion would probably be better. It would also play nice with ARB_vertex_array_objects. Also, how does this all work without AGP? There still are a fair number of PCI cards out there. :) A lot of this is also very Linux specific. What can we do to make as much of this as possible OS independent? I don't think our BSD friends will be very happy if we leave them in the cold. :) Linux is most people's first priority, but it's not the /only/ priority... |
From: magenta <ma...@tr...> - 2003-01-17 21:25:53
|
On Fri, Jan 17, 2003 at 12:09:58PM -0800, Ian Romanick wrote: > > > struct memory_block { > > u32 age_variable; > > u32 status; > > }; > > > > Where the age variable is device dependant, but I would imagine in most > > cases is a monotonically increasing unsigned 32-bit number. There needs to > > be a device driver function to check if an age has happened on the hardware. > > I don't think having an age variable in the shared area is necessary or > sufficient. That's what my original can-swap bit was all about. Each > item that is in a block would have its own age variable / fence. When > all of the age variable / fence conditions were satisfied, the can-swap > bit would be set. Also, using an age variable leads to lots of other really difficult issues, like what happens when it wraps around. A clock algorithm (as it was called in my undergraduate courses, anyway) for paging out would probably be better. > > The BLOCK_LOG2 stuff is > > a way to pack the usage of this block of memory in just a few bits. We pack > > log2 - 1, where we only accept usages of 2 bytes or more. Using 2 bytes > > could be considered empty. We can store upto block usage sizes of 64k in > > this manner. I think that we want 64kb to be our maximum size for a block. > > That's probably finer granularity than we need. We could probably get > away with "empty", "mostly empty", "half full", "mostly full", and > "full". Admittedly, that only saves one bit, but it removes the 64KB limit. > > One thing this is missing is some way to prioritize which blocks are to > be swapped out. Right now the blocks are stored in a LRU linked list, > but I don't think that's necessarilly the best way (the explicit linked > list) to go. ><snip> > A lot of this is also very Linux specific. What can we do to make as > much of this as possible OS independent? I don't think our BSD friends > will be very happy if we leave them in the cold. :) Linux is most > people's first priority, but it's not the /only/ priority... So having the kernel do it probably isn't the best way. :) -- http://trikuare.cx |
From: Ian R. <id...@us...> - 2003-01-18 01:11:48
|
magenta wrote: > On Fri, Jan 17, 2003 at 12:09:58PM -0800, Ian Romanick wrote: > >>>struct memory_block { >>> u32 age_variable; >>> u32 status; >>>}; >>> >>> Where the age variable is device dependant, but I would imagine in most >>>cases is a monotonically increasing unsigned 32-bit number. There needs to >>>be a device driver function to check if an age has happened on the hardware. >> >>I don't think having an age variable in the shared area is necessary or >>sufficient. That's what my original can-swap bit was all about. Each >>item that is in a block would have its own age variable / fence. When >>all of the age variable / fence conditions were satisfied, the can-swap >>bit would be set. > > > Also, using an age variable leads to lots of other really difficult issues, > like what happens when it wraps around. A clock algorithm (as it was > called in my undergraduate courses, anyway) for paging out would probably > be better. I think we're running into some terminology problems here. In the existing memory manager, the age is a "when was it last used" variable. In the new proposals, the age is a fence. There are still wrap-around issues. :( [snip] >>A lot of this is also very Linux specific. What can we do to make as >>much of this as possible OS independent? I don't think our BSD friends >>will be very happy if we leave them in the cold. :) Linux is most >>people's first priority, but it's not the /only/ priority... > > > So having the kernel do it probably isn't the best way. :) Putting some stuff in the kernel is fine as long as we don't rely on exotic, Linux-specific in-kernel interfaces. Putting too heavy of a reliance on the new Linux AGPGART or on specifics of the Linux VM are likely to get us in trouble. It will also make it more difficult to port to other systems. This might be a good time to look at what some of our kernel issues are. I seem to remember a thread about porting the DRM to Solaris that ultimately led to despair. :( |
From: Jeff H. <jha...@ad...> - 2003-01-18 02:16:27
|
> -----Original Message----- > From: dri...@li... > [mailto:dri...@li...]On Behalf Of Ian Romanick > Sent: Friday, January 17, 2003 7:12 PM > To: DRI developer's list > Subject: Re: [Dri-devel] The next round of texture memory management... > > [snip] > > Also, using an age variable leads to lots of other really > difficult issues, > > like what happens when it wraps around. A clock algorithm (as it was > > called in my undergraduate courses, anyway) for paging out > would probably > > be better. > > I think we're running into some terminology problems here. In the > existing memory manager, the age is a "when was it last used" variable. > In the new proposals, the age is a fence. There are still wrap-around > issues. :( Actually if it is a straight monotonically increasing unsigned 32-bit counter we can do the signed comparision: (s32)current_age_counter - (s32)buffer_age < 0 Just like the code in the linux kernel that does signed compares to deal with timer wraps. [snip] > Putting some stuff in the kernel is fine as long as we don't rely on > exotic, Linux-specific in-kernel interfaces. Putting too heavy of a > reliance on the new Linux AGPGART or on specifics of the Linux VM are > likely to get us in trouble. > > It will also make it more difficult to port to other systems. This > might be a good time to look at what some of our kernel issues are. I > seem to remember a thread about porting the DRM to Solaris that > ultimately led to despair. :( It is always a balance, putting things in the kernel and userspace. I'm fairly confident we will need some kernel support for this project, if we want to acheive our goal. I think we will try and keep the requirements on the kernel well defined though, and not too exotic. Perhaps too we will allow some/most of the benefits of the code to run on some operating systems, while we get full usage only on systems that support certain features we need. As the system gets designed I suppose we will just have to try and keep these issues in mind I guess. -Jeff |
From: magenta <ma...@tr...> - 2003-01-18 02:49:25
|
On Fri, Jan 17, 2003 at 08:13:05PM -0600, Jeff Hartmann wrote: > > > > -----Original Message----- > > From: dri...@li... > > [mailto:dri...@li...]On Behalf Of Ian Romanick > > Sent: Friday, January 17, 2003 7:12 PM > > To: DRI developer's list > > Subject: Re: [Dri-devel] The next round of texture memory management... > > > > > [snip] > > > > Also, using an age variable leads to lots of other really > > difficult issues, > > > like what happens when it wraps around. A clock algorithm (as it was > > > called in my undergraduate courses, anyway) for paging out > > would probably > > > be better. > > > > I think we're running into some terminology problems here. In the > > existing memory manager, the age is a "when was it last used" variable. > > In the new proposals, the age is a fence. There are still wrap-around > > issues. :( > > Actually if it is a straight monotonically increasing unsigned 32-bit > counter we can do the signed comparision: > (s32)current_age_counter - (s32)buffer_age < 0 Assuming 'current age counter' is a time value for 'expire everything older than this,' then yes, that works, as long as you have fewer than 2^32 objects in the memory pool (which I don't see as a problem in the forseeable future). :) I thought that the age counter was just going to be something for finding a minimum value for LRU. But if you use a wraparound counter based on, say, the number of events which have occurred, then you'd might as well just use the clock algorithm instead, and require only one bit for the 'purge_okay' flag. It'll have the same results. > Just like the code in the linux kernel that does signed compares to deal > with timer wraps. As long as you're only comparing timer events which have happened within the past 2^32 clock ticks, sure. > [snip] > > Putting some stuff in the kernel is fine as long as we don't rely on > > exotic, Linux-specific in-kernel interfaces. Putting too heavy of a > > reliance on the new Linux AGPGART or on specifics of the Linux VM are > > likely to get us in trouble. > > > > It will also make it more difficult to port to other systems. This > > might be a good time to look at what some of our kernel issues are. I > > seem to remember a thread about porting the DRM to Solaris that > > ultimately led to despair. :( > > It is always a balance, putting things in the kernel and userspace. I'm > fairly confident we will need some kernel support for this project, if we > want to acheive our goal. I think we will try and keep the requirements on > the kernel well defined though, and not too exotic. Perhaps too we will > allow some/most of the benefits of the code to run on some operating > systems, while we get full usage only on systems that support certain > features we need. As the system gets designed I suppose we will just have > to try and keep these issues in mind I guess. I'm a big fan of abstraction layers, myself... why not define an abstraction layer which provides the various functionality needed by the memory manager, and then put OS-specific stuff into the implementation? Surely the model won't be so different between any two OSes that it can't be boiled down to a single common set of higher-level functions... Or is that me being too naive again? (I'm a graphics programmer, not a kernel programmer, and I know just enough about systems level stuff to be dangerous. :) -- http://trikuare.cx |
From: Ian R. <id...@us...> - 2003-01-20 17:24:07
|
magenta wrote: > On Fri, Jan 17, 2003 at 08:13:05PM -0600, Jeff Hartmann wrote: >>>I think we're running into some terminology problems here. In the >>>existing memory manager, the age is a "when was it last used" variable. >>> In the new proposals, the age is a fence. There are still wrap-around >>>issues. :( >> >>Actually if it is a straight monotonically increasing unsigned 32-bit >>counter we can do the signed comparision: >>(s32)current_age_counter - (s32)buffer_age < 0 > > Assuming 'current age counter' is a time value for 'expire everything older > than this,' then yes, that works, as long as you have fewer than 2^32 > objects in the memory pool (which I don't see as a problem in the > forseeable future). :) I thought that the age counter was just going to be > something for finding a minimum value for LRU. But if you use a wraparound > counter based on, say, the number of events which have occurred, then you'd > might as well just use the clock algorithm instead, and require only one > bit for the 'purge_okay' flag. It'll have the same results. There is one subtle, but important, difference. The can-swap bit needs to be updated when the fence is completed. If the fence value is stored in the block when the fence is set, no further update is required. This was one of the problems in my original post. If only a can-swap bit is used, how do you prevent clients from forgetting to set the bit when it can be set? I'm still not 100% convinced that it is the best sollution, but it is a sollution. :) [snip] >>It is always a balance, putting things in the kernel and userspace. I'm >>fairly confident we will need some kernel support for this project, if we >>want to acheive our goal. I think we will try and keep the requirements on >>the kernel well defined though, and not too exotic. Perhaps too we will >>allow some/most of the benefits of the code to run on some operating >>systems, while we get full usage only on systems that support certain >>features we need. As the system gets designed I suppose we will just have >>to try and keep these issues in mind I guess. > > I'm a big fan of abstraction layers, myself... why not define an > abstraction layer which provides the various functionality needed by the > memory manager, and then put OS-specific stuff into the implementation? > Surely the model won't be so different between any two OSes that it can't > be boiled down to a single common set of higher-level functions... Or is > that me being too naive again? (I'm a graphics programmer, not a kernel > programmer, and I know just enough about systems level stuff to be > dangerous. :) This is the way most of the DRM currently works. I suspect that it will be possible to continue doing so. |
From: Jeff H. <jha...@ad...> - 2003-01-17 23:35:15
|
> -----Original Message----- > From: dri...@li... > [mailto:dri...@li...]On Behalf Of Ian Romanick > Sent: Friday, January 17, 2003 2:10 PM > To: DRI developer's list > Subject: Re: [Dri-devel] The next round of texture memory management... > > > Jeff Hartmann wrote: > > >>That may not be possible. Right now the blocks are tracked in the > >>SAREA, and that puts an upper limit on the number of block available. > >>On a 64MB memory region, the current memory manager ends up with 64KB > >>blocks, IIRC. As memories get bigger (both on-card and AGP apertures), > >>the blocks will get bigger. Also right now each block only requires 4 > >>bytes in the SAREA. Any changes that would be made for a new memory > >>manager would make each block require more space, thereby reducing the > >>number of blocks that could fit in the SAREA. > >> > >>Even if we increase the size of the SAREA, a system with 128MB of > >>on-card memory and 128MB AGP aperture would require ~65000 blocks (if > >>each block covered 4KB). > > > > Don't worry too much about this, we can create an entirely > new SAREA to > > hold the memory manager. It can also be rather large, I'm > thinking about > > 128KB or so wouldn't be a problem at all. This will be non swappable > > memory, but thats not too big a deal. Here is what I'm > thinking of as the > > general block format right now, it might not be perfect: > > That works. It should also be possible to have it vary its size > depending on the amount of memory to be managed. Yeah that shouldn't be too difficult to accomplish. > > [code segment snipped] > > > struct memory_block { > > u32 age_variable; > > u32 status; > > }; > > > > Where the age variable is device dependant, but I would > imagine in most > > cases is a monotonically increasing unsigned 32-bit number. > There needs to > > be a device driver function to check if an age has happened on > the hardware. > > I don't think having an age variable in the shared area is necessary or > sufficient. That's what my original can-swap bit was all about. Each > item that is in a block would have its own age variable / fence. When > all of the age variable / fence conditions were satisfied, the can-swap > bit would be set. Actually I think it is the best way, all you do is put the "greatest" or "latest" age variable in the block description. That way only when we are only done when the last thing is fenced. Makes swap decisions a HELL of alot easier, that way we don't have to have any nasty signal code and age lists all over the place. > > > The status variable has some room, only the bottom 28-bits > are defined at > > the moment. The first 4 bits are some status bits. If > BLOCK_CAN_SWAP is > > set, we can swap this block, swapping requires the driver to > call the kernel > > to swap out this block using some agp method where the contents are > > preserved. Can be accomplished by card DMA. If > BLOCK_LINKS_TO_NEXT is set > > we are part of a group of blocks, which must be treated as a unit. If > > BLOCK_CAN_BE_CLOBBERED is set, the driver can just overwrite > this block of > > memory. If BLOCK_IS_CACHABLE is set we can readback from this > block in a > > fast way, so fallbacks can directly use this block. > > That's interesting. I hadn't considered having kernel intervention to > actually page out blocks. I had alway been on the assumption that all > blocks in AGP or on-card memory were either locked or throw-away. Yeah thats a big important thing here, having some of the operations happen in the kernel allows you to do some really nice things. My main concern is having the logic outside of the kernel, the kernel does some things better then anyone else, and can do things other people can't. As long as the kernel doesn't have to make the decisions and keep around enough information to make the proper decisions I'm happy with the implementation. > > Just like with regular virtual memory, I think we only need to "page > out" pages that we're going to use. I don't think we should need to > page out an entire set of linked pages. Initially we may want to, > though. It wouldn't help much with on-card memory, but with AGP memory > (where we can change mappings), we should be able to do some tricks to > avoid having to do full re-loads. It's also possible that only a subset > of the blocks belonging to an object will have been modified. > > Perhaps what we really need to know for each block is: > > 1. Is the block modified (i.e., by glCopyTexImage)? > 2. What pages in system memory back the block? That is, where are the > parts of the texture in system memory that represent the block in AGP / > on-card memory? Now this is too much information I think. We may want to store which agp key references blocks in some sort of separate way, but I don't know how useful that information would be... I have to do some thinking here. Here is my thoughts about things: 1. We are a particular page inside a particular address space. We only know we are page #n in that address space. We don't care about anything else, our page number is our offset. We would have a card pool and an agp pool. We also have a swapped out pool, but things here probably can't be directly accessed. We need kernel intervention to allow us to access these swapped out things. If the address space is segmented into several mappings we need an address mapping function in the client side 3D driver. Not terrible difficult and we don't have to store too much information. 2. We consider the block or group of blocks as an entire "unit", everything is done on units, not individual pieces of the blocks. That prevents people swapping out the first page of a group of textures and someone having to wait for just that block to come back. 3. Only large agp allocations are swapped out at one time. Little blocks are blitted into a 1MB region, when it is full and the blits are committed, we can decide to swap them out of the agp aperture. This avoids lots of small pages being swapped out thrashing with lots of agp gatt table accesses and potentially causing nasty things like cpu cache flushes too much. 4. Implementations without agp will require at least PCI DMA, or a slow function that copies over the bus. They will have only one pool, and will be considered "swapped" when they aren't in the card pool. If the card supports some sort of PCI-GART we could treat it similarly to agp memory. 5. It might be useful to know some metrics about what kind of memory we are, backbuffer, texture, etc. I'm not sure if we really need to know this information, but it could be useful. > > Hmm...starts to fell like a regular virtual memory system... Hehe, thats about the long and the short of it... > > > The BLOCK_LOG2 stuff is > > a way to pack the usage of this block of memory in just a few > bits. We pack > > log2 - 1, where we only accept usages of 2 bytes or more. Using 2 bytes > > could be considered empty. We can store upto block usage sizes > of 64k in > > this manner. I think that we want 64kb to be our maximum size > for a block. > > That's probably finer granularity than we need. We could probably get > away with "empty", "mostly empty", "half full", "mostly full", and > "full". Admittedly, that only saves one bit, but it removes the > 64KB limit. Sounds okay. > > One thing this is missing is some way to prioritize which blocks are to > be swapped out. Right now the blocks are stored in a LRU linked list, > but I don't think that's necessarilly the best way (the explicit linked > list) to go. Selection might happen LRU or not. The reason for making the age variable public though is so we could perhaps weigh using it. We could also weigh decisions on memory type if we encode that information. Keeping memory type information around also allows us to make private->shared backbuffer / depthbuffer decisions easier and without as much or perhaps any client intervention. There needs to be a selection of which pages to grab next if going in a linear fashion in a clients address space fails. Perhaps it jumps by a preset limit (the normal address space each client carves out for itself) and tries again. Perhaps if something like that fails we fall back to linear age based scanning. Perhaps we keep some sort of freelist based on regions. We can encode a region of 256 megs by page offset and number of 4k pages in a single 32-bit number. Change the page size a little bit and the requirement of bits becomes smaller. I'm thinking at this point and don't have the perfect data structure and logic worked out just yet for the freelist/usedlist part of the memory manager. I suppose though thats what these technical emails are for. Originally at VA I thought to just use something like the Utah-GLX memory manager for the freelist, or perhaps Keith's block memory manager, but just to extend it. I'm not so sure that this is the proper solution though. I guess writing down some of the attributes we want are in order, and trying to think through the problem: 1. We want a method to find out if our pages we messed with, and this should be fast and trivial. Hashing into a bit vector based on id seems like a good thing here. I describe this in detail later in the email when I answer one of your questions. 2. We want a fast method to find a free region if any exist. A free region should be randomly selected or selected with a weight towards the page(s) being close to an address range we specify. A queue, stack or list come to mind here. Lists have such poor performance sometimes though.... Should the lists be stored inside the data blocks?, I don't think so but it might be an implementation that makes sense. I tend to think the "allocation" structures could/should be separate from the pagelist. Here could be some possible freelist implementations: a. Each page is a bit in a bit vector. A set bit means the memory is in use. Find the first zero bit and look for zero bits of a certain number afterwards for a particular sized region. Would have really good performance in the normal case where we are not over committed, and could be made to index to a particular address space easily. There are drawbacks here though, we could potentially use alot of memory with all the bit vectors in our memory manager. Also this doesn't help us too much in the over committed case, we would need something to run through the pagelist and swap out by age in a linear fashion when things get over committed. Or perhaps in a random fashion, dunno. This could happen in a kernel thread or in the Xserver at regular intervals though, so we always attempt to have some room in the freelist. b. We go with a list, queue, or stack. 6 bytes for 256 megs / 4kb pages for a single link, or 8 bytes for 256 megs / 4 kb pages for a double link. We make the head of the list at initialization point to the whole region and we split much like the Utah implementation did. Could be LRU or MRU. Might be faster then previous method unless we want to weigh by address space, then it might get more complicated. Unfortunately this is only kept sorted on age, not where we are in the address space. It could have poor selection of free region performance and things might tend to be grouped and have some bad behavior. Perhaps we keep two separate lists, the allocated list and the free list. c. Something a little more exotic might have better performance. Perhaps keeping a binary tree as a front end to the region lists, that way allowing us to select quickly based on address space. Perhaps slice up the address space into big (say 4 MB) regions and have a list for each region. Perhaps hashing based on region size. I suppose the possibilities are endless here. While something like these ideas might work, I usually go back to the drawing board when I end up with too exotic a solution. Simple and elegant tends to work best in most situations. 3. We want an easy method to grow the memory backing an agp pool, but also some sort of per client restriction, perhaps just the system wide restrictions will do? This should be solved by the agp extension proposal I made earlier in the week. 4. We want a simple way to determine if an age allows us to do something to a texture which has the BLOCK_CAN_BE_CLOBBERED bit set, storing the last age used on a block is all that should be required I think. > > > The bits 27:8 would be a 20-bit number representing a block > id. Each one > > would be unique, so the driver could keep track of what blocks > represent a > > texture. A 20-bit number should be sufficient, since that gives > us like 2 > > million values to work with. > > > > This is a pretty good start for a block format I think. We > want to make > > the memory management SAREA have a lock of its own, shouldn't > be a big deal > > to extend the drm to provide us with one. Or perhaps we use the normal > > device lock when we do any management, I haven't decided yet. There are > > some issues to really think about here. > > > > This sort of implementation needs the kernel to be able to > swap out a block > > from agp memory. The kernel should reserve a portion of the > agp aperture > > for this purpose. Probably on the order of 2-4 MB. Each > allocation of the > > agp aperture should be no smaller then 1MB in size, to prevent > agpgart from > > having to deal with too many blocks of memory. It will also > have to be no > > smaller then the agp_page_shift, in case someone is using 4MB agp pages. > > The kernel will blit with a card specified function the designated block > > from its current position to its final position in the block of > agp memory > > to be swapped. When the ENTIRE block is full, then the kernel will call > > agpgart to swap that region out of the agp aperture. The > kernel will keep > > track of what each swapped out block contains in some manner, > or might brute > > force scan the shared memory area containing the swapped out blocks. > > Okay. There's a few details of this that I'm not seeing. I'm sure > they're there, I'm just not seeing them. > > Process A needs to allocate some blocks (or even just a single block) > for a texture. It scans the list of blocks and finds that not enough > free blocks are available. It performs some hokus-pokus and determines > that a block "owned" by process B needs to be freed. That block has th > BLOCK_CAN_SWAP bit set, but the BLOCK_CAN_BE_CLOBBERED bit is cleared. > > Process A asks the kernel to page the block out. Then what? How does > process B find out that its block was stolen and page it back in? Okay here is how I think things could happen: I want to page the block out, I request to the kernel to return when this list of pages that I give you have been swapped out and are available. If the kernel can immediately process this request, do it and return, if I have to do some dma put the client on a wait queue to be woken up when it happens. The kernel goes ahead and updates the blocks in the SAREA saying that they aren't there (marking their id's as zero perhaps) Process B comes along an sees its textures aren't resident and needs them, it asks the kernel to make them resident somewhere, it doesn't care where. It passes some ID's to the kernel and asks the kernel to make them resident. The kernel puts the process on a waitqueue or returns in a similar fashion to the first request. Whenever we get the lock with contention we must do some sort of quick scanning. We might want to speed up the process somehow, perhaps some sort of hashing by texture number to a dirty flag. Actually that is probably the best implementation. If we reserve 64k of address space to be our dirty flags (backed only when accessed) we can make the dirty flags a bit vector. Considering the texture or block id as an index into this vector we can rapidly find out if our list of textures has been "fooled" with. This prevents us from scanning the entire list, which could be slow. I should also point out that the id's will be reused. We will always attempt to use the smallest id available for use. This way using it as an index into a shared memory area isn't so bad. That way we avoid using lots of memory for nothing when we only have a few texture blocks. > > > There will be a non backed shared memory area that contains > all the swapped > > out pages, the swapped pool it probably a good thing to call > it. Basically > > its a shared memory area, of say 1MB in size that doesn't have any pages > > backing it. It will have a kernel no page function that populates it if > > needed. Basically it will only have information in it if > things are swapped > > out of the aperture. > > > > There needs to be a kernel function which moves a block of > memory into > > cacheable space. We could do with with PCI dma, or some magic > conversion of > > unbound agp pages. This could be made safe, and wouldn't be a > big deal with > > the new agpgart vm stuff. That way the block of agp memory could be > > accessed by a fallback or some other function that needs to > directly read > > the texture. Readback from normal agp memory is horrible, > something on the > > order of 60MB/sec. > > The conversion would probably be better. It would also play nice with > ARB_vertex_array_objects. Also I should point out, on some systems we have the nice ability to have cached agp memory. On these systems we need no conversion, or perhaps just moving the texture into a cachable memory block. On these systems it might even make sense to have all textures marked cachable, but that will take some experimentation. > > Also, how does this all work without AGP? There still are a fair number > of PCI cards out there. :) > > A lot of this is also very Linux specific. What can we do to make as > much of this as possible OS independent? I don't think our BSD friends > will be very happy if we leave them in the cold. :) Linux is most > people's first priority, but it's not the /only/ priority... While it is Linux specific, the modifications and improvements I make to agpgart to make this happen can be ported. I don't think it will require too much more then that, the functions that will plug into the kernel could all be portable much like the rest of the driver code is currently. Some nice additions to agpgart are all that is required to make this possible I think. As for using pci dma or simple copying of card memory to pci memory that would probably be directly portable without any or little effort. -Jeff |
From: Ian R. <id...@us...> - 2003-01-18 01:05:09
|
Jeff Hartmann wrote: >>>struct memory_block { >>> u32 age_variable; >>> u32 status; >>>}; >>> >>> Where the age variable is device dependant, but I would imagine in most >>>cases is a monotonically increasing unsigned 32-bit number. There needs to >>>be a device driver function to check if an age has happened on the hardware. >> >>I don't think having an age variable in the shared area is necessary or >>sufficient. That's what my original can-swap bit was all about. Each >>item that is in a block would have its own age variable / fence. When >>all of the age variable / fence conditions were satisfied, the can-swap >>bit would be set. > > Actually I think it is the best way, all you do is put the "greatest" or > "latest" age variable in the block description. That way only when we are > only done when the last thing is fenced. Makes swap decisions a HELL of > alot easier, that way we don't have to have any nasty signal code and age > lists all over the place. The potential problem is there are somethings that can't be tracked by a simple "age." The one thing I can think of is back-buffers. An application might have several buffer-swap operations that are blocked waiting for a certain vertical blank number. There could be other rendering operations that are sent after the buffer-swap that will complete BEFORE the blit for the buffer-swap is queued. I can't see a reasonable way to assign an age to those back-buffers. Since this is the only case I can think of, there may be a different way to handle it. [snip] >>Just like with regular virtual memory, I think we only need to "page >>out" pages that we're going to use. I don't think we should need to >>page out an entire set of linked pages. Initially we may want to, >>though. It wouldn't help much with on-card memory, but with AGP memory >>(where we can change mappings), we should be able to do some tricks to >>avoid having to do full re-loads. It's also possible that only a subset >>of the blocks belonging to an object will have been modified. >> >>Perhaps what we really need to know for each block is: >> >>1. Is the block modified (i.e., by glCopyTexImage)? >>2. What pages in system memory back the block? That is, where are the >>parts of the texture in system memory that represent the block in AGP / >>on-card memory? > > > Now this is too much information I think. We may want to store which agp > key references blocks in some sort of separate way, but I don't know how > useful that information would be... I have to do some thinking here. Here > is my thoughts about things: > > 1. We are a particular page inside a particular address space. We only know > we are page #n in that address space. We don't care about anything else, > our page number is our offset. We would have a card pool and an agp pool. > We also have a swapped out pool, but things here probably can't be directly > accessed. We need kernel intervention to allow us to access these swapped > out things. If the address space is segmented into several mappings we need > an address mapping function in the client side 3D driver. Not terrible > difficult and we don't have to store too much information. > > 2. We consider the block or group of blocks as an entire "unit", everything > is done on units, not individual pieces of the blocks. That prevents people > swapping out the first page of a group of textures and someone having to > wait for just that block to come back. I believe that the block should be unit used. If each block has a group ID (the IDs that you talk about below) and a sequence number, we can do some very nice optimizations. Imagine a case where we have two textures that use 51% of the available texture space. Performance would DIE if we had to bring in the entire texture every single time. We can do a little optimization and only bring in 2% of texture memory each time instead of 102%. > 3. Only large agp allocations are swapped out at one time. Little blocks > are blitted into a 1MB region, when it is full and the blits are committed, > we can decide to swap them out of the agp aperture. This avoids lots of > small pages being swapped out thrashing with lots of agp gatt table accesses > and potentially causing nasty things like cpu cache flushes too much. > > 4. Implementations without agp will require at least PCI DMA, or a slow > function that copies over the bus. They will have only one pool, and will > be considered "swapped" when they aren't in the card pool. If the card > supports some sort of PCI-GART we could treat it similarly to agp memory. > > 5. It might be useful to know some metrics about what kind of memory we are, > backbuffer, texture, etc. I'm not sure if we really need to know this > information, but it could be useful. As we get a bit father along we'll need to decide exactly what information we want to store with each block to help make swap-out decisions. We could let each process make that descision. With each block it stores three values. The first value is the cost of restoring a block. The second is the normalized probability that the block will be needed during the current frame, and the third is the probability that will be needed in the next frame. Values of zero for the probability mean "I don't know." The cost value could probably be inferred from the status bits and the fullness value. With these values it becomes pretty simple for the kernel to select candidate blocks to reclaim. [snip] >>One thing this is missing is some way to prioritize which blocks are to >>be swapped out. Right now the blocks are stored in a LRU linked list, >>but I don't think that's necessarilly the best way (the explicit linked >>list) to go. > > Selection might happen LRU or not. The reason for making the age variable > public though is so we could perhaps weigh using it. We could also weigh > decisions on memory type if we encode that information. Keeping memory type > information around also allows us to make private->shared backbuffer / > depthbuffer decisions easier and without as much or perhaps any client > intervention. > > There needs to be a selection of which pages to grab next if going in a > linear fashion in a clients address space fails. Perhaps it jumps by a > preset limit (the normal address space each client carves out for itself) > and tries again. Perhaps if something like that fails we fall back to > linear age based scanning. Perhaps we keep some sort of freelist based on > regions. We can encode a region of 256 megs by page offset and number of 4k > pages in a single 32-bit number. Change the page size a little bit and the > requirement of bits becomes smaller. > > I'm thinking at this point and don't have the perfect data structure and > logic worked out just yet for the freelist/usedlist part of the memory > manager. I suppose though thats what these technical emails are for. > Originally at VA I thought to just use something like the Utah-GLX memory > manager for the freelist, or perhaps Keith's block memory manager, but just > to extend it. I'm not so sure that this is the proper solution though. > > I guess writing down some of the attributes we want are in order, and trying > to think through the problem: > 1. We want a method to find out if our pages we messed with, and this should > be fast and trivial. Hashing into a bit vector based on id seems like a > good thing here. I describe this in detail later in the email when I answer > one of your questions. > > 2. We want a fast method to find a free region if any exist. A free region > should be randomly selected or selected with a weight towards the page(s) > being close to an address range we specify. A queue, stack or list come to > mind here. Lists have such poor performance sometimes though.... Should > the lists be stored inside the data blocks?, I don't think so but it might > be an implementation that makes sense. I tend to think the "allocation" > structures could/should be separate from the pagelist. One quick comment here. We *cannot* store any of our memory-manager data in on-card memory. There are cards that store textures and / or vertex data in memory that is not accessable by the CPU. There are a couple of Sun chips like that, and I think one or two 3dlabs chips might be like that. Who knows what the next ATI, Matrox, or Intel chip might do. > Here could be some possible freelist implementations: > a. Each page is a bit in a bit vector. A set bit means the memory is in > use. Find the first zero bit and look for zero bits of a certain number > afterwards for a particular sized region. Would have really good > performance in the normal case where we are not over committed, and could be > made to index to a particular address space easily. There are drawbacks > here though, we could potentially use alot of memory with all the bit > vectors in our memory manager. Also this doesn't help us too much in the > over committed case, we would need something to run through the pagelist and > swap out by age in a linear fashion when things get over committed. Or > perhaps in a random fashion, dunno. This could happen in a kernel thread or > in the Xserver at regular intervals though, so we always attempt to have > some room in the freelist. I don't think having a clean-up thread will be very helpful. When there is activity happening, it will likely be happening at a very high rate. I think that the thread would either sleep all the time (with nothing to do) or be continuously woken up on-demand. > b. We go with a list, queue, or stack. 6 bytes for 256 megs / 4kb pages for > a single link, or 8 bytes for 256 megs / 4 kb pages for a double link. We > make the head of the list at initialization point to the whole region and we > split much like the Utah implementation did. Could be LRU or MRU. Might be > faster then previous method unless we want to weigh by address space, then > it might get more complicated. Unfortunately this is only kept sorted on > age, not where we are in the address space. It could have poor selection of > free region performance and things might tend to be grouped and have some > bad behavior. Perhaps we keep two separate lists, the allocated list and > the free list. > > c. Something a little more exotic might have better performance. Perhaps > keeping a binary tree as a front end to the region lists, that way allowing > us to select quickly based on address space. Perhaps slice up the address > space into big (say 4 MB) regions and have a list for each region. Perhaps > hashing based on region size. I suppose the possibilities are endless here. > While something like these ideas might work, I usually go back to the > drawing board when I end up with too exotic a solution. Simple and elegant > tends to work best in most situations. One their own (a) and (b) are too simple. That won't give us enough functionality to do all the things we want. The memory bit-map has the advantage that we can pick the largest free block and try to reclaim memory around that free block until it is large enough to satisfy the request. After having tried to do that with the existing linked list approach, I can honestly say that a linked list is a very poor datastructure for that purpose. On the flip side, the bit-map doesn't store much useful information about the allocated blocks. That makes it difficult to select which memory to reclaim when memory is very full. The problem is that BOTH of these situations are important performance cases! > 3. We want an easy method to grow the memory backing an agp pool, but also > some sort of per client restriction, perhaps just the system wide > restrictions will do? This should be solved by the agp extension proposal I > made earlier in the week. What are the "system wide restrictions"? Available memory? > 4. We want a simple way to determine if an age allows us to do something to > a texture which has the BLOCK_CAN_BE_CLOBBERED bit set, storing the last age > used on a block is all that should be required I think. For textures and vertex buffers this should be true. [snip] >>Okay. There's a few details of this that I'm not seeing. I'm sure >>they're there, I'm just not seeing them. >> >>Process A needs to allocate some blocks (or even just a single block) >>for a texture. It scans the list of blocks and finds that not enough >>free blocks are available. It performs some hokus-pokus and determines >>that a block "owned" by process B needs to be freed. That block has th >>BLOCK_CAN_SWAP bit set, but the BLOCK_CAN_BE_CLOBBERED bit is cleared. >> >>Process A asks the kernel to page the block out. Then what? How does >>process B find out that its block was stolen and page it back in? > > Okay here is how I think things could happen: > > I want to page the block out, I request to the kernel to return when this > list of pages that I give you have been swapped out and are available. If > the kernel can immediately process this request, do it and return, if I have > to do some dma put the client on a wait queue to be woken up when it > happens. > > The kernel goes ahead and updates the blocks in the SAREA saying that they > aren't there (marking their id's as zero perhaps) > > Process B comes along an sees its textures aren't resident and needs them, > it asks the kernel to make them resident somewhere, it doesn't care where. > It passes some ID's to the kernel and asks the kernel to make them resident. > The kernel puts the process on a waitqueue or returns in a similar fashion > to the first request. Up to this point, I follow you. Here is my problem. Say we have 16KB blocks. Say process B had a single block that had 4 vertex buffers in. The first 3 are 1KB and the last one is 2KB (and hangs over into the next block). This first block is the one that process A selected to swap-out. Is the ID number assigned to the first block (the one that was swapped-out) and the second block (that wasn't swapped-out) the same? It seems like it should be (and that would enable the kernel to shuffle things around and keep blocks contiguous), but it seems like it would be difficult (or at least irritating) to keep all the block IDs correct (as subregions in the blocks are allocated and freed by a process). When process B goes to sleep waiting for its blocks to come back, what locks will it hold? If it doesn't hold any, how do we prevent process A from coming back and stealing the second block? > Whenever we get the lock with contention we must do some sort of quick > scanning. We might want to speed up the process somehow, perhaps some sort > of hashing by texture number to a dirty flag. Actually that is probably the > best implementation. If we reserve 64k of address space to be our dirty > flags (backed only when accessed) we can make the dirty flags a bit vector. > Considering the texture or block id as an index into this vector we can > rapidly find out if our list of textures has been "fooled" with. This > prevents us from scanning the entire list, which could be slow. That would be easy enough. When a process wants to issue rendering it would grab the lock, check the bits for each of the objects (textures, vertex buffers, etc.) it wants to use. It would test the bit. If the bit is set, that means "partially not here." The process would then check the blocks that actually map to the object it wants to use. In the memory lay out above, it process B wants to render from one of the 1KB vertex buffers, it only has to make sure that the first block is paged in. If the second block is out, it doesn't matter. It then issues the rendering command, updates the fence, releases the lock. > I should also point out that the id's will be reused. We will always > attempt to use the smallest id available for use. This way using it as an > index into a shared memory area isn't so bad. That way we avoid using lots > of memory for nothing when we only have a few texture blocks. How would we dole out IDs? Would there be a kernel call to get / release a set of IDs? I feel like we're making some excelent progress here. Hurray for open-source! :) |
From: Jeff H. <jha...@ad...> - 2003-01-20 05:59:27
|
> > The potential problem is there are somethings that can't be tracked by a > simple "age." The one thing I can think of is back-buffers. An > application might have several buffer-swap operations that are blocked > waiting for a certain vertical blank number. There could be other > rendering operations that are sent after the buffer-swap that will > complete BEFORE the blit for the buffer-swap is queued. I can't see a > reasonable way to assign an age to those back-buffers. > > Since this is the only case I can think of, there may be a different way > to handle it. Well then it looks like storing the type of memory is important and something we need to look at I guess. > > > 2. We consider the block or group of blocks as an entire > "unit", everything > > is done on units, not individual pieces of the blocks. That > prevents people > > swapping out the first page of a group of textures and someone having to > > wait for just that block to come back. > > I believe that the block should be unit used. If each block has a group > ID (the IDs that you talk about below) and a sequence number, we can do > some very nice optimizations. Imagine a case where we have two textures > that use 51% of the available texture space. Performance would DIE if > we had to bring in the entire texture every single time. We can do a > little optimization and only bring in 2% of texture memory each time > instead of 102%. Just a slight comment here, if the memory has actually made it out of the agp aperture no matter how big the page is the cost of getting it back is the same. Course I really like the idea of storing sequence numbers though, gives room for lots of flexibility. And for situations where we haven't fully made it out of the agp aperture, the cost of the blit is much smaller. > As we get a bit father along we'll need to decide exactly what > information we want to store with each block to help make swap-out > decisions. > > We could let each process make that descision. With each block it > stores three values. The first value is the cost of restoring a block. > The second is the normalized probability that the block will be needed > during the current frame, and the third is the probability that will be > needed in the next frame. Values of zero for the probability mean "I > don't know." The cost value could probably be inferred from the status > bits and the fullness value. > > With these values it becomes pretty simple for the kernel to select > candidate blocks to reclaim. Just a small clarification here, the decision should be done in user space, you tell the kernel to swap such and such block out. The kernel has no brains when it comes to memory managing decisions. The reasons for this are two fold, kernel code tends to not change as easily as user space code, and the whole initial reason of not wanting to do too much complex stuff in the kernel. > I don't think having a clean-up thread will be very helpful. When there > is activity happening, it will likely be happening at a very high rate. > I think that the thread would either sleep all the time (with nothing > to do) or be continuously woken up on-demand. After thinking about this a little more I think your right, that was a bogus idea. > One their own (a) and (b) are too simple. That won't give us enough > functionality to do all the things we want. The memory bit-map has the > advantage that we can pick the largest free block and try to reclaim > memory around that free block until it is large enough to satisfy the > request. After having tried to do that with the existing linked list > approach, I can honestly say that a linked list is a very poor > datastructure for that purpose. > > On the flip side, the bit-map doesn't store much useful information > about the allocated blocks. That makes it difficult to select which > memory to reclaim when memory is very full. > > The problem is that BOTH of these situations are important performance > cases! Okay I have done some thinking, and I think I have a pretty good solution. Under normal cases we try and get a completely unused portion of memory by using the bit vector freelist. If one is unavailable and we can't grow agp memory for some reason we fall back onto another data structure, which is a priority heap (priority queue or binary heap depending on the data structures book you look at.) That should have good performance and makes more sense then a linked list I think, as far as size is concerned. We could implement a heap with just an array of index values into the pagelist. We might get some vampiric(sp?) performance issues from just about every operation on the heap being log n. However our heap should be of a small enough height so this won't make too much a difference over a linked list. We should make this a min style heap, where age is the value used for the key in the heap. That way the top of the heap is always the texture block or group with the minimum age (longest since it was last used by the card). We want to avoid ever doing a search of the heap, so we store the index into the heap with the pagelist. When we rearrange something in the heap we make sure we update these indices. Perhaps we want to do a different selection then on longest since last use, but I think that might give us reasonable performance. I suppose MRU would be trivial if its a max heap, so that sort of thing would be pretty easy to implement with this data structure as well. I know you get different answers depending on who you ask, but whats the replacement strategy you would recommend? Btw, in case there is any doubt we are storing only used blocks in the bit vector. Also by storing the index of where a block is in the heap if we want to quickly go from the freelist bit vector to a the position in the heap its a very simple operation. > > > 3. We want an easy method to grow the memory backing an agp > pool, but also > > some sort of per client restriction, perhaps just the system wide > > restrictions will do? This should be solved by the agp > extension proposal I > > made earlier in the week. > > What are the "system wide restrictions"? Available memory? Well there are only so many pages of memory that can be physically allocated. We need to limit how much a client will request so as not cause performance of the entire system to go in the toilet. agpgart already has some limitations on what it will allow to be allocated and that limit might be enough. We might want to allow each client to only add upto 8MB to the aperture or something like that as well, but those sort of restrictions need some thinking about. > > > 4. We want a simple way to determine if an age allows us to do > something to > > a texture which has the BLOCK_CAN_BE_CLOBBERED bit set, storing > the last age > > used on a block is all that should be required I think. > > For textures and vertex buffers this should be true. > > [snip] > > >>Okay. There's a few details of this that I'm not seeing. I'm sure > >>they're there, I'm just not seeing them. > >> > >>Process A needs to allocate some blocks (or even just a single block) > >>for a texture. It scans the list of blocks and finds that not enough > >>free blocks are available. It performs some hokus-pokus and determines > >>that a block "owned" by process B needs to be freed. That block has th > >>BLOCK_CAN_SWAP bit set, but the BLOCK_CAN_BE_CLOBBERED bit is cleared. > >> > >>Process A asks the kernel to page the block out. Then what? How does > >>process B find out that its block was stolen and page it back in? > > > > Okay here is how I think things could happen: > > > > I want to page the block out, I request to the kernel to return > when this > > list of pages that I give you have been swapped out and are > available. If > > the kernel can immediately process this request, do it and > return, if I have > > to do some dma put the client on a wait queue to be woken up when it > > happens. > > > > The kernel goes ahead and updates the blocks in the SAREA > saying that they > > aren't there (marking their id's as zero perhaps) > > > > Process B comes along an sees its textures aren't resident and > needs them, > > it asks the kernel to make them resident somewhere, it doesn't > care where. > > It passes some ID's to the kernel and asks the kernel to make > them resident. > > The kernel puts the process on a waitqueue or returns in a > similar fashion > > to the first request. > > Up to this point, I follow you. Here is my problem. Say we have 16KB > blocks. Say process B had a single block that had 4 vertex buffers in. > The first 3 are 1KB and the last one is 2KB (and hangs over into the > next block). This first block is the one that process A selected to > swap-out. > > Is the ID number assigned to the first block (the one that was > swapped-out) and the second block (that wasn't swapped-out) the same? > It seems like it should be (and that would enable the kernel to shuffle > things around and keep blocks contiguous), but it seems like it would be > difficult (or at least irritating) to keep all the block IDs correct (as > subregions in the blocks are allocated and freed by a process). > > When process B goes to sleep waiting for its blocks to come back, what > locks will it hold? If it doesn't hold any, how do we prevent process A > from coming back and stealing the second block? We don't need to hold any locks when we sleep, but since we have to ask the kernel for the block back it can decide who to wake up and who to put to sleep. We can just wake in a FIFO fashion from the wait queue. We can also perhaps make a simple decision in the kernel. If it sees that there is alot of contention on one area then it could "recommend" another area in some fashion to put its textures. Here could be an easy way to accomplish this I think: We have an ioctl which pages in a block id, not just a single page. It writes back to user space the page number where the block set now lives. That way if the kernel sees ALOT of contention for an area of memory it might try to put it somewhere else. However perhaps we could do something like this in user space too...... Actually now that I think about it, the less the kernel does to mess up user space the better. I still like the whole idea of managing everything by the block id though, and keeping sequence numbers in the blocks in some fashion. Perhaps that whole idea of letting the kernel recommend putting something somewhere else is completely bogus. I was tempted to completely erase it, but I leave it this email in case it is good fodder for discussion. > > > Whenever we get the lock with contention we must do some sort of quick > > scanning. We might want to speed up the process somehow, > perhaps some sort > > of hashing by texture number to a dirty flag. Actually that is > probably the > > best implementation. If we reserve 64k of address space to be our dirty > > flags (backed only when accessed) we can make the dirty flags a > bit vector. > > Considering the texture or block id as an index into this vector we can > > rapidly find out if our list of textures has been "fooled" with. This > > prevents us from scanning the entire list, which could be slow. > > That would be easy enough. When a process wants to issue rendering it > would grab the lock, check the bits for each of the objects (textures, > vertex buffers, etc.) it wants to use. It would test the bit. If the > bit is set, that means "partially not here." The process would then > check the blocks that actually map to the object it wants to use. In > the memory lay out above, it process B wants to render from one of the > 1KB vertex buffers, it only has to make sure that the first block is > paged in. If the second block is out, it doesn't matter. It then > issues the rendering command, updates the fence, releases the lock. > > > I should also point out that the id's will be reused. We will always > > attempt to use the smallest id available for use. This way > using it as an > > index into a shared memory area isn't so bad. That way we > avoid using lots > > of memory for nothing when we only have a few texture blocks. > > How would we dole out IDs? Would there be a kernel call to get / > release a set of IDs? Probably, however there are methods to do this without kernel intervention. The good ole' bit vector is great for doling out keys. ;) Thats how agpgart handles its keys actually. Oah btw did I mention how much I like bit vectors. ;) Some little bit of wisdom about everything looks like a nail when you walk around with a hammer immediately comes to mind. ;) -Jeff |
From: Ian R. <id...@us...> - 2003-01-20 18:31:24
|
Jeff Hartmann wrote: >>>2. We consider the block or group of blocks as an entire "unit", everything >>>is done on units, not individual pieces of the blocks. That prevents people >>>swapping out the first page of a group of textures and someone having to >>>wait for just that block to come back. >> >>I believe that the block should be unit used. If each block has a group >>ID (the IDs that you talk about below) and a sequence number, we can do >>some very nice optimizations. Imagine a case where we have two textures >>that use 51% of the available texture space. Performance would DIE if >>we had to bring in the entire texture every single time. We can do a >>little optimization and only bring in 2% of texture memory each time >>instead of 102%. > > Just a slight comment here, if the memory has actually made it out of the > agp aperture no matter how big the page is the cost of getting it back is > the same. Course I really like the idea of storing sequence numbers though, > gives room for lots of flexibility. And for situations where we haven't > fully made it out of the agp aperture, the cost of the blit is much smaller. That's not quite what I meant. Imaging the user has 40MB of on-card memory available for textures. If that user runs an application that uses two 2048x2048x32bpp textures per-frame (each one weighting in at ~22MB), we will either have to use AGP memory for one of them, or (if the card is PCI with no PCIGART) it will have to copy 44MB across the PCI bus every single frame. In reality, to fit the second texture in on-card memory, we only have to reclaim 4MB. We could then hit a steady state where 18MB of each texture never moves out of on-card memory, and only 8MB of texture needs to be copied in each frame. If we view a texture (or any object) as an ordered sequence of blocks instead of a monolithic lump, we can make that optimization. Based on what you have written below, I think we're in agreement that this is the way to go. :) [big snip] >>One their own (a) and (b) are too simple. That won't give us enough >>functionality to do all the things we want. The memory bit-map has the >>advantage that we can pick the largest free block and try to reclaim >>memory around that free block until it is large enough to satisfy the >>request. After having tried to do that with the existing linked list >>approach, I can honestly say that a linked list is a very poor >>datastructure for that purpose. >> >>On the flip side, the bit-map doesn't store much useful information >>about the allocated blocks. That makes it difficult to select which >>memory to reclaim when memory is very full. >> >>The problem is that BOTH of these situations are important performance >>cases! > > > Okay I have done some thinking, and I think I have a pretty good solution. > Under normal cases we try and get a completely unused portion of memory by > using the bit vector freelist. If one is unavailable and we can't grow agp > memory for some reason we fall back onto another data structure, which is a > priority heap (priority queue or binary heap depending on the data > structures book you look at.) > That should have good performance and makes more sense then a linked list I > think, as far as size is concerned. We could implement a heap with just an > array of index values into the pagelist. We might get some vampiric(sp?) > performance issues from just about every operation on the heap being log n. > However our heap should be of a small enough height so this won't make too > much a difference over a linked list. We should make this a min style heap, > where age is the value used for the key in the heap. That way the top of > the heap is always the texture block or group with the minimum age (longest > since it was last used by the card). We want to avoid ever doing a search > of the heap, so we store the index into the heap with the pagelist. When we > rearrange something in the heap we make sure we update these indices. > Perhaps we want to do a different selection then on longest since last use, > but I think that might give us reasonable performance. I suppose MRU would > be trivial if its a max heap, so that sort of thing would be pretty easy to > implement with this data structure as well. I know you get different > answers depending on who you ask, but whats the replacement strategy you > would recommend? It would take some experiments to prove it, but I believe that simple LRU or MRU is always suboptimal in both the theoretical and practical case. One example where any type of priority queue fails is where you need one more block than is available in the largest free region. If the largest free region is 54 blocks and 55 blocks are needed, the optimal sollution is most likely to reclaim one of the used blocks at the head or tail of the free region. I think we'd also prefer to reclaim blocks that don't need to be swapped (i.e., have the throw-away bit set). There are some cases, depending on the number of blocks needed, where we'd also prefer to reclaim blocks that aren't part of a sequence. We may well settle on using a slight variation of simple LRU or MRU. I'd like to see our initial implementation allow a little more flexability to experiment with gathering different heuristics to improve performance. [another big snip] >>Up to this point, I follow you. Here is my problem. Say we have 16KB >>blocks. Say process B had a single block that had 4 vertex buffers in. >> The first 3 are 1KB and the last one is 2KB (and hangs over into the >>next block). This first block is the one that process A selected to >>swap-out. >> >>Is the ID number assigned to the first block (the one that was >>swapped-out) and the second block (that wasn't swapped-out) the same? >>It seems like it should be (and that would enable the kernel to shuffle >>things around and keep blocks contiguous), but it seems like it would be >>difficult (or at least irritating) to keep all the block IDs correct (as >>subregions in the blocks are allocated and freed by a process). >> >>When process B goes to sleep waiting for its blocks to come back, what >>locks will it hold? If it doesn't hold any, how do we prevent process A >>from coming back and stealing the second block? > > > We don't need to hold any locks when we sleep, but since we have to ask the > kernel for the block back it can decide who to wake up and who to put to > sleep. We can just wake in a FIFO fashion from the wait queue. We can also > perhaps make a simple decision in the kernel. If it sees that there is alot > of contention on one area then it could "recommend" another area in some > fashion to put its textures. I got to thinking about this, and I came up with another sollution. When process B is going to reclaim blocks from process A, process B marks the reclaimed blocks with its block ID and clears the can-swap bit. It also clears the can-swap bit on other blocks (that it already has) that are part of the sequence. When process B calls into the kernel, it passes in the block number and the old block ID & sequence information. This prevents process A from getting the CPU and trying to take the blocks back. [snip] >>How would we dole out IDs? Would there be a kernel call to get / >>release a set of IDs? > > Probably, however there are methods to do this without kernel intervention. > The good ole' bit vector is great for doling out keys. ;) Oh. Duh. :) |
From: Jens O. <je...@tu...> - 2003-01-17 17:38:03
|
Ian, I had a chance to read your ideas on memory managment last night. First off, I'd like to thank you for doing a very good job of collecting requirements and then seperating out your ideas for implementation. This level of discipline really helps me understand where you are constrained by requirements vs. where you are exploring solutions. As you address the very complex issue of virtualizing graphics subsystem resources, I'm going to attempt to influence your thinking to include the concept of a 3D desktop compositing engine. You've make references to capabilities that Apple is supporting, yet to me the ultimate challenge that the Apple desktop paradigm provides today is the 3D and composoting effects they are doing with Genie bottle window iconfication and multilevel window transparancy. Starting to address these capabilities in open source will put additional requirements on the resource management requirements. Ian Romanick wrote: > What follows is the collected requirements for the new DRI memory > manager. This list is the product of several discussions between Brian, > Keith, Allen, and myself several months ago. After the list, I have > included some of my thoughts on the big picture that I see from these > requirements. > > 1. Single-copy textures > > Right now each texture exists in two or three places. There is a copy > in on-card or AGP memory, in system memory (managed by the driver), and > in application memory. Any solution should be able to eliminate one or > two of those copies. > > If the driver-tracked copy in system memory is eliminated, care must be > taken when the texture needs to be removed from on-card / AGP memory. > Additionally, changes to the texture image made via glCopyTexImage must > not be lost. > > It may be possible to eliminate one copy of the texture using > APPLE_client_storage. A portion of this could be done purely in Mesa. > If the user supplied image matches the internal format of the texture, > then the driver can use the application's copy of the texture in place > of the driver's copy. > > Modulo implementation difficulties, it may even be possible to use the > pages that hold the texture as backing store for a portion of the AGP > aperture. The is the only way to truly achieve single-copy textures. > The implementation may prove too difficult on existing x86 systems to be > worth the effort. This functionality is available in MacOS 10.1, so the > same difficulties may not exist on Linux PPC. Are the AGP aperture issues present for any AGP page swapping, or just for assigning new, random virutal memory pages? I was under the impression that preallocated AGP memory could be swapped in and out on the x86 platform. In other words, it would be difficult to dynamically map a user texture into the AGP aperature, but we could create a pool of AGP memory that was larger than the apperature and use the APPLE_client_storage extension to allocate space from that pool to the application. > 2. Share texture memory among multiple OpenGL contexts > > Texture memory is currently shared by all OpenGL contexts. That is, > when an OpenGL context switch happens it is not necessary to reload all > textures. The texture manager needs to continue to use a paged memory > model (as opposed to a segmented memory model). > > 3. Accommodate other OpenGL buffers > > The allocator should also be used for allocating vertex buffers, render > targets (pbuffers, back-buffers, depth-buffers, etc.), and other > buffers. This can be useful beyond supporting SGIX_pbuffer, > ARB_vertex_array_objects, and optimized display lists. Dynamically > allocating per-context depth and back-buffers will allow multiple Z > depths be used at a time (i.e., 16-bit depth-buffer for one window and > 24-bit depth-buffer for another) and super-sampling FSAA. For traditional 2D window systems, this requirement is sufficient in that you don't need to be able to truly provide an unlimited amount of private buffer space...rather when you run out of space, you can fall back to a method where memory is allocated from a single large buffer based on visible display pixels. That said, a 3D compositing window system couldn't fall back on this method. Imagine N transparent windows all stacked on top of each other and each needing dedicated display resources in order to yield the correct final display results. Virtualizing an infinite number of color and alpha layers may not be possible in hardware alone, but software compositing can be prohibitively slow. Perhaps providing a large dedicated amount of resources to 3D compositing and virualizing all non visible resources could provide a reasonable solution. This implies that back buffers, depth buffers, pbuffers and super sampled buffers all need to be potentially swapped out when the rendering context is swapped out. > 4. Support texture pseudo-render targets > > Accelerating some OpenGL functions, such as glCopyTexImage, > SGIS_generate_mipmaps, and ARB_render_texture, may require special > support and consideration. > > 5. Additional AGP related issues > > There may be cases where textures need to be moved back-and-forth > between AGP and on-card memory. For example, a texture might reside in > AGP memory, and an operation may be requested that requires that the > texture be in on-card memory. > > 6. Additional texture formats and layouts > > Compressed, 1D, 3D, cube map, and non-power-of-two textures need to be > supported in addition to "traditional" 2D power-of-two textures. > > 7. Allen Akin's pinned-texture proposal > > If we ever expose memory management to the user (beyond texture > priorities) we want to be sure our allocator is designed with this in mind. > > 8. Device independence > > As much as possible, the source code for the memory manager should live > somewhere device independent. This is both for the benefit of newly > developed drivers and for maintaining existing drivers. > > * My Thoughts * > > There are really only two radical departures from the existing memory > manager. The first is using the memory manager for non-texture memory > objects. The second, which is partially a result of the first, is the > need to "pin" objects. It would not do to have one context kick another > context's depth-buffer out of memory! Why not swap out another context's depth buffer? If it's not being used at the time, is that any worse than swapping out textures that are actively being used by the yielding context? > My initial thought on how to accomplish this was to move the allocator > into the kernel. There would be a low-level allocator that could be > used for non-texture buffers and a way to create textures (from data). > In the texture case, the kernel would only allocate memory when a > texture was used. In stead of using the actual texture address in > drawing command streams, the user-level driver would insert texture IDs. > The kernel would use these IDs to map to real texture addresses. > > The benefit is that all memory management would be handled by a single > omniscient execution context (the kernel). The downside is that it > would move a LOT of code into the kernel. It would be almost entirely > OS and device independent, but there would likely be a lot of it. > > After talking with Jeff Hartmann in IRC on 1/13, I started thinking > about all of this again. Jeff had some serious reservations about > moving that volume of code into the kernel, and he believed that all of > the requirements could be met by a purely user-space implementation. > After thinking about things some more, I'm starting to agree. > > What follows is a fairly random series of thoughts on how a user-space > memory manager could be made to work. > > I believe that everything could be done by breaking each memory space > down into blocks (as is currently done) and tracking two values, either > implicitly or explicitly, with each block. The first value is some sort > of swap-out priority. This is currently implicitly tracked by the list > ordering in the SAREA. The other value is basically a semaphore, but it > could be implemented as a simple can-swap bit. > > Blocks that have active depth-buffer would never have can-swap set. > Blocks that have "normal" textures, back-buffer, render-target textures, > and puffers would have their can-swap bit conditionally set. Each of > these types of blocks would have the can-swap bit cleared under the > following situations: > > - Normal textures - While a rendering operation is queued that > will use the texture. > - SGIS_generate_mipmaps textures - While the blits are in progress > to create the filtered mipmaps. > - glCopyTexImage textures - While the blit to copy image data to > the texture is in progress and while the data in the texture has > not been copied to some sort of backing store. > - pbuffers - While rendering operations to the pbuffer are in > progress. pbuffers have a mechanism to tell an application when > the contents of the pbuffer have been "lost." This could be > exploited by the memory manager. One caveat is when a pbuffer > is bound to a texture (ARB_render_texture). While the pbuffer > is bound to a texture, its contents cannot be lost. Can the > contents be "swapped out" to some sort of backing store, like > with glCopyTexImage targets? There is another caveat for PBuffers that Allen brought to my attention a few years ago. They way they are currently defined, it's possible for the application to request a PBuffer that can not be "destroyed", but rather must be swapped out and then restored later. > - Back-buffers - In unextended GLX, back-buffers can never be > swapped. However, if OML_sync_control is available, a "double > buffered" visual may want to have many virtual back-buffers. > Each time glXSwapBuffersMscOML (essentially an asynchronous > glXSwapBuffers call) is made, a new back-buffer is allocated as > the rendering target. Once a back-buffer is copied to the > front-buffer (i.e., the queued buffer-swap completes), the > back-buffer can be swapped-out. > > There may be other situations where can-swap is cleared, but that's all > I could think of. Similar rules would exist for vertex buffers (for > ARB_vertex_array_object, EXT_compiled_vertex_array, optimized display > lists, etc.). > > Only a single bit per block is needed in the SAREA. That bit is the > union of the bits for each object that is part of that block. This > union must be calculated by the user-space driver. This presents a > possible problem of user-space clients failing to update the can-swap > bits for some reason (process hung on blocking IO call?). The current > implementation avoids this problem by forcing all bocks to be swappable > at all times. > > At this point I'm left with a few questions. > > 1. In a scheme like this, how could processes be forced to update the > can-swap bits on blocks that they own? > 2. What is the best way for processes to be notified of events that > could cause can-swap bits to change (i.e., rendering completion, > asynchronous buffer-swap completion, etc.)? Signals from the kernel? > Polling "age" variables? > 3. If some sort of signal based notification is used, could it be used > to implement NV_fence and / or APPLE_fence? > 4. How could the memory manager handle objects that span multiple > blocks? In other words, could the memory manager be made to prefer > to swap-out blocks that wholly contain all of the objects that > overlap the block? Are there other useful metrics? Prefer to > swap-out blocks that are half full over blocks that are completely > full? > 5. What other things I have I missed that might prevent this system > from working? :) > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: Thawte.com > Understand how to protect your customers personal information by > implementing > SSL on your Apache Web Server. Click here to get our FREE Thawte Apache > Guide: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0029en > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel > > I hope I'm bringing in some food for thought...and not unnecesarily complicating an already difficult and important DRI improvement. -- /\ Jens Owen / \/\ _ je...@tu... / \ \ \ Steamboat Springs, Colorado |
From: Ian R. <id...@us...> - 2003-01-17 19:03:36
|
Jens Owen wrote: > Ian, > > I had a chance to read your ideas on memory managment last night. First > off, I'd like to thank you for doing a very good job of collecting > requirements and then seperating out your ideas for implementation. This > level of discipline really helps me understand where you are constrained > by requirements vs. where you are exploring solutions. > > As you address the very complex issue of virtualizing graphics subsystem > resources, I'm going to attempt to influence your thinking to include > the concept of a 3D desktop compositing engine. You've make references > to capabilities that Apple is supporting, yet to me the ultimate > challenge that the Apple desktop paradigm provides today is the 3D and > composoting effects they are doing with Genie bottle window iconfication > and multilevel window transparancy. Starting to address these > capabilities in open source will put additional requirements on the > resource management requirements. > > Ian Romanick wrote: > >> What follows is the collected requirements for the new DRI memory >> manager. This list is the product of several discussions between >> Brian, Keith, Allen, and myself several months ago. After the list, I >> have included some of my thoughts on the big picture that I see from >> these requirements. >> >> 1. Single-copy textures >> >> Right now each texture exists in two or three places. There is a copy >> in on-card or AGP memory, in system memory (managed by the driver), >> and in application memory. Any solution should be able to eliminate >> one or two of those copies. >> >> If the driver-tracked copy in system memory is eliminated, care must >> be taken when the texture needs to be removed from on-card / AGP >> memory. Additionally, changes to the texture image made via >> glCopyTexImage must not be lost. >> >> It may be possible to eliminate one copy of the texture using >> APPLE_client_storage. A portion of this could be done purely in Mesa. >> If the user supplied image matches the internal format of the texture, >> then the driver can use the application's copy of the texture in place >> of the driver's copy. >> >> Modulo implementation difficulties, it may even be possible to use the >> pages that hold the texture as backing store for a portion of the AGP >> aperture. The is the only way to truly achieve single-copy textures. >> The implementation may prove too difficult on existing x86 systems to >> be worth the effort. This functionality is available in MacOS 10.1, >> so the same difficulties may not exist on Linux PPC. > > Are the AGP aperture issues present for any AGP page swapping, or just > for assigning new, random virutal memory pages? I was under the > impression that preallocated AGP memory could be swapped in and out on > the x86 platform. In other words, it would be difficult to dynamically > map a user texture into the AGP aperature, but we could create a pool of > AGP memory that was larger than the apperature and use the > APPLE_client_storage extension to allocate space from that pool to the > application. AFAIK, your assumptions about AGP mappings are correct. Jeff would be the one to ask, though. :) APPLE_client_storage isn't really an allocator. It allows applications to tell the GL that it can keep and use pointers to application storage (i.e., pointers passed into TexImage2D). The optimization that can be done on MacOS is to not only use those kept pointers as the backing store for textures, but also remapping those pages into AGP space to be directly used by the graphics hardware. Having multiple physical pages to back AGP pages could be useful for ARB_vertex_array_objects, so I'll keep that usage in mind. [snip] >> 3. Accommodate other OpenGL buffers >> >> The allocator should also be used for allocating vertex buffers, >> render targets (pbuffers, back-buffers, depth-buffers, etc.), and >> other buffers. This can be useful beyond supporting SGIX_pbuffer, >> ARB_vertex_array_objects, and optimized display lists. Dynamically >> allocating per-context depth and back-buffers will allow multiple Z >> depths be used at a time (i.e., 16-bit depth-buffer for one window and >> 24-bit depth-buffer for another) and super-sampling FSAA. > > For traditional 2D window systems, this requirement is sufficient in > that you don't need to be able to truly provide an unlimited amount of > private buffer space...rather when you run out of space, you can fall > back to a method where memory is allocated from a single large buffer > based on visible display pixels. That is to say, fall back to the current static back / depth buffer allocation system. That was something that I had considered, but didn't explictly say. For some set of active OpenGL contexts, a kernel memory manager could decide that it was more memory efficient to fall back to the single, full screen back / depth buffer system. > That said, a 3D compositing window system couldn't fall back on this > method. Imagine N transparent windows all stacked on top of each other > and each needing dedicated display resources in order to yield the > correct final display results. Virtualizing an infinite number of color > and alpha layers may not be possible in hardware alone, but software > compositing can be prohibitively slow. Perhaps providing a large > dedicated amount of resources to 3D compositing and virualizing all non > visible resources could provide a reasonable solution. This implies > that back buffers, depth buffers, pbuffers and super sampled buffers all > need to be potentially swapped out when the rendering context is swapped > out. Ugh. THAT is a difficult problem. [snip] >> * My Thoughts * >> >> There are really only two radical departures from the existing memory >> manager. The first is using the memory manager for non-texture memory >> objects. The second, which is partially a result of the first, is the >> need to "pin" objects. It would not do to have one context kick >> another context's depth-buffer out of memory! > > > Why not swap out another context's depth buffer? If it's not being used > at the time, is that any worse than swapping out textures that are > actively being used by the yielding context? With an in-kernel memory manager this would be possible. If everything is running in user-space, one process B would have to copy process A's back-buffer into process A's private address space so that process A could restore it later. This is one of the issues that complicates swapping out render-target textures (i.e., glCopyTexImage targets). When process B need to swap-out process A's texture, it can't do it until process A has copied the modified texture data out of texture memory so that it can be restored later. [snip] >> - pbuffers - While rendering operations to the pbuffer are in >> progress. pbuffers have a mechanism to tell an application when >> the contents of the pbuffer have been "lost." This could be >> exploited by the memory manager. One caveat is when a pbuffer >> is bound to a texture (ARB_render_texture). While the pbuffer >> is bound to a texture, its contents cannot be lost. Can the >> contents be "swapped out" to some sort of backing store, like >> with glCopyTexImage targets? > > There is another caveat for PBuffers that Allen brought to my attention > a few years ago. They way they are currently defined, it's possible for > the application to request a PBuffer that can not be "destroyed", but > rather must be swapped out and then restored later. That is correct. I had forgotten about that. :) |
From: Dieter <Die...@ha...> - 2003-01-17 19:22:38
|
Am Freitag, 17. Januar 2003 18:37 schrieb Jens Owen: > Ian, > > I had a chance to read your ideas on memory managment last night. First > off, I'd like to thank you for doing a very good job of collecting > requirements and then seperating out your ideas for implementation. > This level of discipline really helps me understand where you are > constrained by requirements vs. where you are exploring solutions. > > As you address the very complex issue of virtualizing graphics subsystem > resources, I'm going to attempt to influence your thinking to include > the concept of a 3D desktop compositing engine. You've make references > to capabilities that Apple is supporting, yet to me the ultimate > challenge that the Apple desktop paradigm provides today is the 3D and > composoting effects they are doing with Genie bottle window iconfication > and multilevel window transparancy. Starting to address these > capabilities in open source will put additional requirements on the > resource management requirements. Does this all "fits" with a "video editing" system? Something like integration of the GATOS project (video in/out/DVI/TV) that we can base video cutting systems upon Linux/*BSD? Thanks, Dieter |
From: Jens O. <je...@tu...> - 2003-01-17 21:12:45
|
Dieter N=FCtzel wrote: > Am Freitag, 17. Januar 2003 18:37 schrieb Jens Owen: >=20 >>Ian, >> >>I had a chance to read your ideas on memory managment last night. Firs= t >>off, I'd like to thank you for doing a very good job of collecting >>requirements and then seperating out your ideas for implementation. >>This level of discipline really helps me understand where you are >>constrained by requirements vs. where you are exploring solutions. >> >>As you address the very complex issue of virtualizing graphics subsyste= m >>resources, I'm going to attempt to influence your thinking to include >>the concept of a 3D desktop compositing engine. You've make references >>to capabilities that Apple is supporting, yet to me the ultimate >>challenge that the Apple desktop paradigm provides today is the 3D and >>composoting effects they are doing with Genie bottle window iconficatio= n >>and multilevel window transparancy. Starting to address these >>capabilities in open source will put additional requirements on the >>resource management requirements. >=20 >=20 > Does this all "fits" with a "video editing" system? > Something like integration of the GATOS project (video in/out/DVI/TV) t= hat we=20 > can base video cutting systems upon Linux/*BSD? Not directly. The video streaming capabilities would tax the used use=20 of rendering contexts and backbuffers similar to an active 3D=20 application. I'm referring more to how the desktop's compositing engine=20 would generate special effects while still supporting the load of active=20 3D, 2D and video rendering contexts. --=20 /\ Jens Owen / \/\ _ je...@tu... / \ \ \ Steamboat Springs, Colorado |
From: Sven L. <lu...@dp...> - 2003-01-18 08:35:19
|
On Thu, Jan 16, 2003 at 05:33:42PM -0800, Ian Romanick wrote: > What follows is the collected requirements for the new DRI memory > manager. This list is the product of several discussions between Brian, > Keith, Allen, and myself several months ago. After the list, I have > included some of my thoughts on the big picture that I see from these > requirements. > > 1. Single-copy textures > > Right now each texture exists in two or three places. There is a copy > in on-card or AGP memory, in system memory (managed by the driver), and > in application memory. Any solution should be able to eliminate one or > two of those copies. ... BTW, since you are looking into this, have you thought about graphic chips which can do MMU like tricks. I am not sure if the current set of graphic chips the DRI runs on do this kind of stuff, but they well may in the future. I know the gamma drm module use the gamma's virtual memory table to not need to do virtual<->physical conversion. But more importantly to you, altough there is not yet a DRI driver for it, the 3Dlabs permedia3 can use virtual memory for its textures. That is you can basically set up the graphic boards memory as a cache memory, and have the the MMU-like unit swap the memory pages from host memory, using i suppose its own page replacement algorithm. Friendly, Sven Luther |
From: Ian R. <id...@us...> - 2003-01-20 17:31:00
|
Sven Luther wrote: > On Thu, Jan 16, 2003 at 05:33:42PM -0800, Ian Romanick wrote: > >>1. Single-copy textures >> >>Right now each texture exists in two or three places. There is a copy >>in on-card or AGP memory, in system memory (managed by the driver), and >>in application memory. Any solution should be able to eliminate one or >>two of those copies. > > ... > > BTW, since you are looking into this, have you thought about graphic > chips which can do MMU like tricks. I am not sure if the current set of > graphic chips the DRI runs on do this kind of stuff, but they well may > in the future. I know the gamma drm module use the gamma's virtual > memory table to not need to do virtual<->physical conversion. But more > importantly to you, altough there is not yet a DRI driver for it, the > 3Dlabs permedia3 can use virtual memory for its textures. That is you > can basically set up the graphic boards memory as a cache memory, and > have the the MMU-like unit swap the memory pages from host memory, using > i suppose its own page replacement algorithm. The only chips that I know of that support this technology are the various, recent 3dlabs chips. They have a number of patents on this technology, and, AFAIK, they have no intention of licensing it to anyone for all the tea in China. I agree that it is a good idea to keep virtual textures in mind, but, since we don't have any hardware documentation for it, it will be difficult to do more than that. |