From: Keith W. <ke...@tu...> - 2002-06-17 19:15:16
|
OK, so I'm working on the r200 kernel interfaces & I'm kindof at a crossroads. What I'd like to do looks like this, and is influenced to a great degree by the r200 sample implementation, but also by my latent desires for the drm... - Scrap the existing dma buffer system entirely. - Provide an allocator for truely private agp & framebuffer memory. This can be used to house textures, commands (ie dma), backbuffers, depthbuffers, display lists, etc. When allocation is required and fails, the client falls back to software rendering. - Have the client emit totally native command streams. Provide two modes for getting these to hardware: - Checked: the kernel module picks apart the stream and verifies it is secure. For best results the stream is emitted to cached memory. - Fast: the kernel module just schedules the commands as an indirect buffer on the ring. (Must be in agp or fb memory...) - Provide a timestamp mechanism (we already have one of these, but it could be a lot better) so that the client can age buffers on its own. --> Together, these make it very easy to implement NV_vertex_array_range, while at the same time simplifying the kernel module hugely. Some issues crop up: Backwards compatibility. The r200 has the same 2d core as the radeon, the existing radeon ddx driver works fine with the existing radeon.o kernel module. Should the r200 have a new kernel module with this functionality, or add to the existing one? All this new stuff would work fine with the radeon, so once the r200 is done, the radeon could move to these mechanisms for free, if they share a kernel module. Furthermore, the existing 2d ddx code is written against the existing radeon.o kernel mechanisms, including dma buffer allocation. A new r200.o kernel module would have to duplicate that functionality, or I would have to rewrite the ddx code to use either the radeon.o or r200.o modules according to which was loaded -- this sounds ugly... On the other hand, if I keep a single module, I have to keep all the old radeon crud in the new r200 module -- and not all of it will even work with the r200. Additionally, I wonder what the point of cleaning up interfaces is if I have to keep all the old ones around too. Thoughts, anyone? Keith |
From: Linus T. <tor...@tr...> - 2002-06-17 20:06:20
|
[ Damn, I wish I knew more about the R200 internals. As it is, I don't have the hw view ] On Mon, 17 Jun 2002, Keith Whitwell wrote: > > - Provide an allocator for truely private agp & framebuffer memory. This can > be used to house textures, commands (ie dma), backbuffers, depthbuffers, > display lists, etc. When allocation is required and fails, the client falls > back to software rendering. Are you talking about client-side sw rendering, or true indirect rendering with GLX? My suggestion is to always fall back on _indirect_ rendering when there is any problem at all, whether due to allocation errors or not. > - Have the client emit totally native command streams. Provide two modes for > getting these to hardware: > - Checked: the kernel module picks apart the stream and verifies it is secure. > For best results the stream is emitted to cached memory. I do not believe you can make "checked" mode work. Any indirection support by the hardware makes checking really really painful. But more importantly, I bet just about any hardware out there will have common commands that can be used to do bad things like making the engine render all over the X font caches or whatever. So you get into a situation where you not only need to check the commands themselves, but you need to check validity of the arguments to the commands etc. Total nightmare city, not to mention that because this is all in kernel space it's also nasty to debug. And even if it were to work on some hardware (ie the hardware itself does enough sanity checking of all the commands and arguments that you can set up a "secure" client mode), it certainly won't do that right now on all 3D hardware. So you would be unable to have a "common code setup", and you'd have to work with different 3D drivers having totally different approaches to this "safe" thing. That doesn't sound like a good architecture to me. > - Fast: the kernel module just schedules the commands as an indirect buffer on > the ring. (Must be in agp or fb memory...) I believe this is reasonable, and acceptable as a "hey, we need to do it as a way to get highest possible performance". Now, the fact that I think that "checked" mode is a bad idea doesn't mean that this would be the _only_ mode. No, I suggest that you have _one_ rule (see above): any problems means that we use server-side indirect rendering with GLX. And then we trust the X server, since we have to trust that one for everything else _anyway_. In short: - kernel always accepts the raw stream directly - the "safe" mode comes from the fact that the author of the raw stream is the (trusted) X server, not the (untrusted) client. - the "fast" mode is nothing but a short-circuit of the stream generation. Advantages: - the kernel doesn't even _know_ what's up, and does only the simple stuff. - You don't have multiple different levels of rendering. You only have one, and the question is just who does it. When done right, the X server hw rendering would share the same engine and the same codebase as the direct app rendering code does. Put another way: Utah-GLX with DRI. Both are right. Both have advantages. Try to just mix the advantages the right way. Kill software rendering on the client side. It's worth doing client-side renderign only if it improves performance noticeably, and that is obviously NOT TRUE unless the client-side renderer is so hw-acclerated that context switches are a major problem. Linus |
From: Keith W. <ke...@tu...> - 2002-06-17 20:20:44
|
Linus Torvalds wrote: > [ Damn, I wish I knew more about the R200 internals. As it is, I don't > have the hw view ] > > On Mon, 17 Jun 2002, Keith Whitwell wrote: > >>- Provide an allocator for truely private agp & framebuffer memory. This can >>be used to house textures, commands (ie dma), backbuffers, depthbuffers, >>display lists, etc. When allocation is required and fails, the client falls >>back to software rendering. >> > > Are you talking about client-side sw rendering, or true indirect rendering > with GLX? Client-side software rendering at this point. > My suggestion is to always fall back on _indirect_ rendering when there is > any problem at all, whether due to allocation errors or not. Transitioning between the two is a difficult task due to the large amount of state maintained in the client side context. Falling back to client-side software rendering is trivial, otoh. >>- Have the client emit totally native command streams. Provide two modes for >>getting these to hardware: >> - Checked: the kernel module picks apart the stream and verifies it is secure. >> For best results the stream is emitted to cached memory. >> > > I do not believe you can make "checked" mode work. Any indirection support > by the hardware makes checking really really painful. But more > importantly, I bet just about any hardware out there will have common > commands that can be used to do bad things like making the engine > render all over the X font caches or whatever. I just intend to do the sort of checking that the current radeon driver does, but do it on native streams rather than 'precooked' ones. Indirection (of commands) is one thing I would outlaw. Basically on the radeon, it boils down to checking that registers being updated fall within certain ranges. > So you get into a situation where you not only need to check the commands > themselves, but you need to check validity of the arguments to the > commands etc. Total nightmare city, not to mention that because this is > all in kernel space it's also nasty to debug. > > And even if it were to work on some hardware (ie the hardware itself does > enough sanity checking of all the commands and arguments that you can set > up a "secure" client mode), it certainly won't do that right now on all 3D > hardware. So you would be unable to have a "common code setup", and you'd > have to work with different 3D drivers having totally different approaches > to this "safe" thing. This is really about the r200 at this point, I'm not proposing a general hw-independent mechanism. However, there are commonalities. In most hardware the bulk of data can bypass checking as it is known to be treated by the hardware as vertices. Bad vertices can scribble the framebuffer, etc, but they can't comprimise security. It's the state updates, typically in a separate command stream, that are problematic & require checking. > That doesn't sound like a good architecture to me. Well, I think the checking can be done pretty simply (check register numbers against good/bad ranges, probably a using a bit field as there aren't that many registers). > >> - Fast: the kernel module just schedules the commands as an indirect buffer on >>the ring. (Must be in agp or fb memory...) >> > > I believe this is reasonable, and acceptable as a "hey, we need to do it > as a way to get highest possible performance". > > Now, the fact that I think that "checked" mode is a bad idea doesn't mean > that this would be the _only_ mode. No, I suggest that you have _one_ > rule (see above): any problems means that we use server-side indirect > rendering with GLX. > > And then we trust the X server, since we have to trust that one for > everything else _anyway_. > > In short: > - kernel always accepts the raw stream directly > - the "safe" mode comes from the fact that the author of the raw stream > is the (trusted) X server, not the (untrusted) client. > - the "fast" mode is nothing but a short-circuit of the stream > generation. > > Advantages: > - the kernel doesn't even _know_ what's up, and does only the simple > stuff. > - You don't have multiple different levels of rendering. You only have > one, and the question is just who does it. > > When done right, the X server hw rendering would share the same engine > and the same codebase as the direct app rendering code does. > > Put another way: Utah-GLX with DRI. Both are right. Both have advantages. > Try to just mix the advantages the right way. > > Kill software rendering on the client side. It's worth doing client-side > renderign only if it improves performance noticeably, and that is > obviously NOT TRUE unless the client-side renderer is so hw-acclerated > that context switches are a major problem. Client software rendering is pretty much impossible to get rid of - there are always things that can't be done with hardware, and as most GL state is orthogonal to most other GL state, you need a fully featured software rasterizer hanging around to catch the fallback cases. Keith |
From: Linus T. <tor...@tr...> - 2002-06-17 20:29:22
|
On Mon, 17 Jun 2002, Keith Whitwell wrote: > > Client software rendering is pretty much impossible to get rid of - there are > always things that can't be done with hardware, and as most GL state is > orthogonal to most other GL state, you need a fully featured software > rasterizer hanging around to catch the fallback cases. I realize that "fallback for the cases that the hw cannot handle" obviously happens in sw. I'm just saying that the client should _never_ fall back on "we can't use the hw, so we should just render everything in sw". If we cannot use hw assist (because of security reasons or anything else), we should fall back to indirect GLX instead of sw rendering. It may be that the X server _can_ use hw assist even when a client can not, and doing server-side rendering can be a whole lot faster in that case. In particular, this is definitely true for the security case. It may be that the R200 is checkable, but other chips definitely aren't. I see your point about switching mid-stream from one to the other being nasty, but why would you ever do that? Linus |
From: Keith W. <ke...@tu...> - 2002-06-17 20:39:08
|
Linus Torvalds wrote: > > On Mon, 17 Jun 2002, Keith Whitwell wrote: > >>Client software rendering is pretty much impossible to get rid of - there are >>always things that can't be done with hardware, and as most GL state is >>orthogonal to most other GL state, you need a fully featured software >>rasterizer hanging around to catch the fallback cases. >> > > I realize that "fallback for the cases that the hw cannot handle" > obviously happens in sw. > > I'm just saying that the client should _never_ fall back on "we can't use > the hw, so we should just render everything in sw". If we cannot use hw > assist (because of security reasons or anything else), we should fall back > to indirect GLX instead of sw rendering. > > It may be that the X server _can_ use hw assist even when a client can > not, and doing server-side rendering can be a whole lot faster in that > case. > > In particular, this is definitely true for the security case. It may be > that the R200 is checkable, but other chips definitely aren't. > > I see your point about switching mid-stream from one to the other being > nasty, but why would you ever do that? OK, I misunderstood you. The changes I proposed included an allocator for private backbuffers. We currently have a single shared backbuffer used with the same cliprects as the front buffer. Allocation of a backbuffer in this scheme can never fail, and no special actions are required when window size changes. With private backbuffers, allocation can fail - both at the creation of the window and when the window is resized. I was suggesting falling back to software rather than coming up with some elaborate way of transitioning between private and shared backbuffers. The same goes for texture allocation, etc. Previously textures could be kicked out by other clients trying to get theirs in. (This trashing can be very slow, even slower than sw rendering) There are good upsides to this -- if we know textures in the framebuffer are there permanently, we can accelerate operations that were previously fallbacks. So yes, a failure to allocate initial buffers might as well send things over to the X server via indirect rendering. Failure to allocate back/depth buffers in midoperations means a fallback until a later realloc succeeds. Failure to alloc a texture buffer means falling back whenver that texture is bound, etc. This is a different approach to what we have now. Keith |
From: Allen A. <ak...@po...> - 2002-06-17 21:03:02
|
On Mon, Jun 17, 2002 at 01:29:36PM -0700, Linus Torvalds wrote: | I realize that "fallback for the cases that the hw cannot handle" | obviously happens in sw. | | I'm just saying that the client should _never_ fall back on "we can't use | the hw, so we should just render everything in sw". If we cannot use hw | assist (because of security reasons or anything else), we should fall back | to indirect GLX instead of sw rendering. | | It may be that the X server _can_ use hw assist even when a client can | not, and doing server-side rendering can be a whole lot faster in that | case. There are at least two issues. One is that if the X server is forced to do software rendering (because the client uses indirect GLX rendering as a fallback instead of performing the software rendering itself), interactivity will definitely suffer. I'm not sure this is what you had in mind, but just in case, I thought I'd mention it. Two is that OpenGL semantics make it *very* expensive to move rendering responsibility from the client to the X server or back, even if software rendering isn't involved. The GLX spec goes into this at some length (in the discussion of what GLX calls "address spaces"). But a quick explanation is that heavyweight objects like textures and display lists reside in the same memory "region" as the rest of the OpenGL state for a rendering context. This region is in the client for direct rendering, and in the X server for indirect rendering. Moving the responsibility for rendering from client to server forces all the heavyweight objects to be copied from client to server, so that isn't a feasible rendering fallback strategy even if the server never uses software rendering. We can discuss the reasons for the GLX semantics if anyone is interested. The short summary is that they're intended to minimize data duplication and data transfer costs, and eliminate the need for locking concurrent accesses to objects that aren't intended to be shared. Allen |
From: Linus T. <tor...@tr...> - 2002-06-17 21:18:08
|
On Mon, 17 Jun 2002, Allen Akin wrote: > > Two is that OpenGL semantics make it *very* expensive to move rendering > responsibility from the client to the X server or back Note that this was definitely not part of the plan as far as I was concerned. It's insanity trying to move rendering around, regardless of any OpenGL or GLX implementation details - it would just force people to be ridiculously careful about how to maintain state in a "movable" manner. That's just crazy, please don't think I ever meant that. I was hoping that once you started hw rendering, you had no reason to ever stop. That seems to be true today, but not in the world Keith envisions.. Linus |
From: Allen A. <ak...@po...> - 2002-06-17 21:25:12
|
On Mon, Jun 17, 2002 at 02:18:35PM -0700, Linus Torvalds wrote: | On Mon, 17 Jun 2002, Allen Akin wrote: | > | > Two is that OpenGL semantics make it *very* expensive to move rendering | > responsibility from the client to the X server or back | | Note that this was definitely not part of the plan as far as I was | concerned. Good! | I was hoping that once you started hw rendering, you had no reason to ever | stop. That seems to be true today, but not in the world Keith envisions.. For what it's worth, in my experience software fallback after starting hardware rendering is quite common today. Games are an exception, because the developers go out of their way to avoid situations that might cause a fallback. Allen |
From: Keith W. <ke...@tu...> - 2002-06-17 20:31:45
|
>> - Fast: the kernel module just schedules the commands as an indirect buffer on >>the ring. (Must be in agp or fb memory...) >> > > I believe this is reasonable, and acceptable as a "hey, we need to do it > as a way to get highest possible performance". > > Now, the fact that I think that "checked" mode is a bad idea doesn't mean > that this would be the _only_ mode. No, I suggest that you have _one_ > rule (see above): any problems means that we use server-side indirect > rendering with GLX. > > And then we trust the X server, since we have to trust that one for > everything else _anyway_. > > In short: > - kernel always accepts the raw stream directly > - the "safe" mode comes from the fact that the author of the raw stream > is the (trusted) X server, not the (untrusted) client. > - the "fast" mode is nothing but a short-circuit of the stream > generation. > I actually like this scheme too, though I don't really accept your arguments against "checked" mode. I doubt many people would actually end up using a "checked" mode once they know they could strip off the kimono & get unsafe... A slow (ie indirect rendering) safe mode might be a good way of forcing people to make a tradeoff that we've been trying to hide. Keith |
From: Linus T. <tor...@tr...> - 2002-06-17 20:43:24
|
On Mon, 17 Jun 2002, Keith Whitwell wrote: > > A slow (ie indirect rendering) safe mode might be a good way of forcing people > to make a tradeoff that we've been trying to hide. I guess I just don't believe indirect rendering has to be slow. _software_ rendering is slow. But task-switching to the X server to render is not necessarily a bad thing. task-switching has been extremely well optimized, and we can do a task-switch in 1.5 usec on th ekind of hardware I have. That's actually not all that much slower than a system call (1us) Unlike a system call, a task switch obviously has to happen _twice_ to get back to the original program, so on the hardware I just tested that's actually 3us vs 1us, and a task-switch also has to re-instate the TLB contents so it slows down with working set size. So it's not quite that easy to compare. But you can do an awful lot of task switches per second, AND you can take advantage of SMP by letting one CPU handle the IO wait issues. So at least in theory there is nothing that says that indirect hw-accelerated rendering couldn't be quite comparable. The biggest hit to indirect rendering is likely to be the data copy, not the context switch: I don't know if the GLX protocol supports putting things into shared memory areas (ie a GLX + MIT+SHM combination). Linus |
From: Keith W. <ke...@tu...> - 2002-06-17 20:57:44
|
Linus Torvalds wrote: > > On Mon, 17 Jun 2002, Keith Whitwell wrote: > >>A slow (ie indirect rendering) safe mode might be a good way of forcing people >>to make a tradeoff that we've been trying to hide. >> > > I guess I just don't believe indirect rendering has to be slow. > > _software_ rendering is slow. But task-switching to the X server to render > is not necessarily a bad thing. task-switching has been extremely well > optimized, and we can do a task-switch in 1.5 usec on th ekind of hardware > I have. That's actually not all that much slower than a system call (1us) > > Unlike a system call, a task switch obviously has to happen _twice_ to get > back to the original program, so on the hardware I just tested that's > actually 3us vs 1us, and a task-switch also has to re-instate the TLB > contents so it slows down with working set size. So it's not quite that > easy to compare. > > But you can do an awful lot of task switches per second, AND you can take > advantage of SMP by letting one CPU handle the IO wait issues. So at least > in theory there is nothing that says that indirect hw-accelerated > rendering couldn't be quite comparable. > > The biggest hit to indirect rendering is likely to be the data copy, not > the context switch: I don't know if the GLX protocol supports putting > things into shared memory areas (ie a GLX + MIT+SHM combination). There are a couple of glx-specific reasons why indirect rendering is slow. Firstly the protocol is just the dumbest encoding of the incoming client calls & wastes a large percentage of the stream. Secondly, a lot of the GL extensions don't end up getting GLX protocol allocated for them - windows is king... Thirdly, GL has ended up with some mechanisms (like vertex arrays) that are specified in a way (unbounded size, no good update semantics) that they loose all their performance qualities over indirect links. There is a kindof hope on the horizon in the shape of chromium, but I don't know if anyone is seriously considering integrating this into the x server or glx. But despite this, yes, a hw-accelerated indirect glx would be a lot quicker than the current sw implementation. Keith |
From: F. <j_r...@ya...> - 2002-06-17 22:10:35
|
On 2002.06.17 21:57 Keith Whitwell wrote: > Linus Torvalds wrote: >> >> ... >> >> The biggest hit to indirect rendering is likely to be the data copy, not >> the context switch: I don't know if the GLX protocol supports putting >> things into shared memory areas (ie a GLX + MIT+SHM combination). > Since Linus and Jens first touched this I always though that using a such scheme - where the OpenGL state and rendering is in the X server and the gross communication with the server done with shared memory - a much more versatile approach. It handles security efficiently by facilitates the communication between the OpenGL driver and the card since the 3D driver is in a trusted entity - the X server itself. > There are a couple of glx-specific reasons why indirect rendering is > slow. Firstly the protocol is just the dumbest encoding of the incoming > client calls & wastes a large percentage of the stream. Secondly, a lot > of the GL extensions don't end up getting GLX protocol allocated for > them - windows is king... Thirdly, GL has ended up with some mechanisms > (like vertex arrays) that are specified in a way (unbounded size, no > good update semantics) that they loose all their performance qualities > over indirect links. > > ... We could overcome the GLX difficulties in the same way we do now in libGL with the direct rendering. But I still don't understand why vertex arrays would be such a problem over shared memory. Aren't they basically just readed and transformed into Mesa's vertex buffers? Could't the OpenGL drivers just read these vertex arrays directly of the client memory space from the X process? José Fonseca |
From: Keith W. <ke...@tu...> - 2002-06-17 22:20:01
|
> > We could overcome the GLX difficulties in the same way we do now in > libGL with the direct rendering. > > But I still don't understand why vertex arrays would be such a problem > over shared memory. Aren't they basically just readed and transformed > into Mesa's vertex buffers? Could't the OpenGL drivers just read these > vertex arrays directly of the client memory space from the X process? There's no indication of the 'top' of the vertex buffer, so you don't know how much to transfer. There's no semantics to tell you whether the vertex buffer contents have changed, so you don't know how often to transfer. CVA fixes these problems to some extent. NV_vertex_array_range goes one better and lets the user put the vertex data straight into AGP buffers. Note that vertex data is always trustworthy, so for tcl hardware, you might get good performance if the vertex data can go directly to an AGP buffer. Keith |
From: F. <j_r...@ya...> - 2002-06-17 23:08:57
|
On 2002.06.17 23:19 Keith Whitwell wrote: > >> >> We could overcome the GLX difficulties in the same way we do now in >> libGL with the direct rendering. >> >> But I still don't understand why vertex arrays would be such a problem >> over shared memory. Aren't they basically just readed and transformed >> into Mesa's vertex buffers? Could't the OpenGL drivers just read these >> vertex arrays directly of the client memory space from the X process? > > There's no indication of the 'top' of the vertex buffer, so you don't > know how much to transfer. There's no semantics to tell you whether > the vertex buffer contents have changed, so you don't know how often to > transfer. But why even transfer in the first place? Why not simply map parts of the vertex buffers into the X memory space as they are needed, or is there any impossibility on the Linux architecture to do that? > CVA fixes these problems to some extent. NV_vertex_array_range goes one > better and lets the user put the vertex data straight into AGP buffers. > Note that vertex data is always trustworthy, so for tcl hardware, you > might get good performance if the vertex data can go directly to an AGP > buffer. Trustworthy data is always easier to deal, but the problem is that you can't have the OpenGL state stored both in the client and X space, and you probably need it to generate the TCL vertex data. Having the state on the X would completely solve the security. Of course that my view is somewhat biased because of Mach64 example, where the vertex data isn't trustworthy - they are actually command buffers to the card and there is no guarantee that malicious commands aren't issued. But on the other hand on any card there is always some untrustworthy data issued from the OpenGL driver, and with the current trend of increased complexity and programmability that will probably get worse. José Fonseca |
From: Ian R. <id...@us...> - 2002-06-24 17:08:52
|
On Tue, Jun 18, 2002 at 12:09:02AM +0100, Jos=E9 Fonseca wrote: > On 2002.06.17 23:19 Keith Whitwell wrote: > >=20 > >>=20 > >> We could overcome the GLX difficulties in the same way we do now in=20 > >> libGL with the direct rendering. > >>=20 > >> But I still don't understand why vertex arrays would be such a probl= em=20 > >> over shared memory. Aren't they basically just readed and transforme= d=20 > >> into Mesa's vertex buffers? Could't the OpenGL drivers just read the= se=20 > >> vertex arrays directly of the client memory space from the X process= ? > >=20 > > There's no indication of the 'top' of the vertex buffer, so you don't= =20 > > know how much to transfer. There's no semantics to tell you whether= =20 > > the vertex buffer contents have changed, so you don't know how often = to=20 > > transfer. >=20 > But why even transfer in the first place? Why not simply map parts of t= he=20 > vertex buffers into the X memory space as they are needed, or is there = any=20 > impossibility on the Linux architecture to do that? This is an old message, but I didn't see a reply to this point. The reas= on is that the indirect rendering path they've been talking about is the *sa= me* one used by remote clients. A client running on a different box can't directly map anything, so the indirect clients on the same box (as the X server) have to follow the same rules. --=20 Tell that to the Marines! |
From: F. <j_r...@ya...> - 2002-06-29 21:24:03
|
On Mon, Jun 24, 2002 at 10:08:42AM -0700, Ian Romanick wrote: >On Tue, Jun 18, 2002 at 12:09:02AM +0100, José Fonseca wrote: >> On 2002.06.17 23:19 Keith Whitwell wrote: >> > >> >> >> >> We could overcome the GLX difficulties in the same way we do now in >> >> libGL with the direct rendering. >> >> >> >> But I still don't understand why vertex arrays would be such a problem >> >> over shared memory. Aren't they basically just readed and transformed >> >> into Mesa's vertex buffers? Could't the OpenGL drivers just read these >> >> vertex arrays directly of the client memory space from the X process? >> > >> > There's no indication of the 'top' of the vertex buffer, so you don't >> > know how much to transfer. There's no semantics to tell you whether >> > the vertex buffer contents have changed, so you don't know how often to >> > transfer. >> >> But why even transfer in the first place? Why not simply map parts of the >> vertex buffers into the X memory space as they are needed, or is there any >> impossibility on the Linux architecture to do that? > >This is an old message, but I didn't see a reply to this point. The reason >is that the indirect rendering path they've been talking about is the *same* >one used by remote clients. I know it's the same path of remote clients... >A client running on a different box can't >directly map anything, so the indirect clients on the same box (as the X >server) have to follow the same rules. Not really. That what's extensions like MIT-Shm exist for. Anyway, all this is very academic until someone really starts doing some thing on this but - as Jens said before -, there is no funding for that. That's why I already had planned to do something myself in the future (somewhere in the next year); initially just integrate the Mesa drivers into X/GLcore and depart from there. José Fonseca |
From: Jens O. <je...@tu...> - 2002-06-17 21:09:14
|
Keith Whitwell wrote: > > OK, so I'm working on the r200 kernel interfaces & I'm kindof at a crossroads. > > What I'd like to do looks like this, and is influenced to a great degree by > the r200 sample implementation, but also by my latent desires for the drm... > > - Scrap the existing dma buffer system entirely. > > - Provide an allocator for truely private agp & framebuffer memory. This can > be used to house textures, commands (ie dma), backbuffers, depthbuffers, > display lists, etc. When allocation is required and fails, the client falls > back to software rendering. Is there any way to "swap out" AGP pages (or a subset of pages) from the GART, and replace them with additional pages to delay (perhaps indefinitely) the need for a SW fallback? Perhaps changing the GART table on a per context basis so each context has a 64M maximum (or whatever the chipset supports), but different context can have a different set of 64M pages. > - Have the client emit totally native command streams. Provide two modes for > getting these to hardware: > - Checked: the kernel module picks apart the stream and verifies it is secure. > For best results the stream is emitted to cached memory. In your reply to Linus you mentioned this is the same as the current mechanism. Does the current Radeon driver use cached memory for the primary DMA commands? Won't your native stream need to be copied to AGP memory? Do you expect to validate every byte of the stream, or can you read a smaller subset of the buffer to validate? > - Fast: the kernel module just schedules the commands as an indirect buffer on > the ring. (Must be in agp or fb memory...) I like! Can we take this one step further and put non-array commands in the primary ring and protect the ring with the HW lock? > - Provide a timestamp mechanism (we already have one of these, but it could be > a lot better) so that the client can age buffers on its own. This sounds useful regardless of what approach you take. What kind of benefits do you see coming from an improved client managable buffer aging mechanism? > --> Together, these make it very easy to implement NV_vertex_array_range, > while at the same time simplifying the kernel module hugely. > > Some issues crop up: > > Backwards compatibility. The r200 has the same 2d core as the radeon, the > existing radeon ddx driver works fine with the existing radeon.o kernel > module. Should the r200 have a new kernel module with this functionality, or > add to the existing one? All this new stuff would work fine with the radeon, > so once the r200 is done, the radeon could move to these mechanisms for free, > if they share a kernel module. > Furthermore, the existing 2d ddx code is written against the existing radeon.o > kernel mechanisms, including dma buffer allocation. A new r200.o kernel > module would have to duplicate that functionality, or I would have to rewrite > the ddx code to use either the radeon.o or r200.o modules according to which > was loaded -- this sounds ugly... > On the other hand, if I keep a single module, I have to keep all the old > radeon crud in the new r200 module -- and not all of it will even work with > the r200. Additionally, I wonder what the point of cleaning up interfaces is > if I have to keep all the old ones around too. > > Thoughts, anyone? I would suggest the most complete solution for forward progress AND backward compatability is providing a single radeon kernel module that support the OLD *and* NEW interfaces. Then move *all* user space drivers forward to the NEW interface. Of course, newer user space drivers wouldn't work on older kernels, but this have never been a requirement...just a nice feature. I realize keeping the OLD interface intact is counter productive to cleaning up the code. Take comfort in the idea that you can remove the OLD interface later on when something else forces a break in backwards compatability. If you do not have the time (or desire) to move the combined 2D driver to the NEW interface, then you will be forced to abandon the NEW interface or split the 2D driver into two different personalities (or drivers); I don't see anyway around that if you hope to remove the OLD interface someday. -- /\ Jens Owen / \/\ _ je...@tu... / \ \ \ Steamboat Springs, Colorado |
From: Keith W. <ke...@tu...> - 2002-06-17 21:30:07
|
Jens Owen wrote: > Keith Whitwell wrote: > >>OK, so I'm working on the r200 kernel interfaces & I'm kindof at a crossroads. >> >>What I'd like to do looks like this, and is influenced to a great degree by >>the r200 sample implementation, but also by my latent desires for the drm... >> >>- Scrap the existing dma buffer system entirely. >> >>- Provide an allocator for truely private agp & framebuffer memory. This can >>be used to house textures, commands (ie dma), backbuffers, depthbuffers, >>display lists, etc. When allocation is required and fails, the client falls >>back to software rendering. >> > > Is there any way to "swap out" AGP pages (or a subset of pages) from the > GART, and replace them with additional pages to delay (perhaps > indefinitely) the need for a SW fallback? Perhaps changing the GART > table on a per context basis so each context has a 64M maximum (or > whatever the chipset supports), but different context can have a > different set of 64M pages. Perhaps, but I think sw rendering is probably no slower. > >>- Have the client emit totally native command streams. Provide two modes for >>getting these to hardware: >> - Checked: the kernel module picks apart the stream and verifies it is secure. >> For best results the stream is emitted to cached memory. >> > > In your reply to Linus you mentioned this is the same as the current > mechanism. Does the current Radeon driver use cached memory for the > primary DMA commands? Yes > Won't your native stream need to be copied to AGP > memory? Yes > Do you expect to validate every byte of the stream, or can you > read a smaller subset of the buffer to validate? A bit smaller, but not significantly. > >> - Fast: the kernel module just schedules the commands as an indirect buffer on >>the ring. (Must be in agp or fb memory...) >> > > I like! Can we take this one step further and put non-array commands in > the primary ring and protect the ring with the HW lock? One purpose of the indirect buffer mechanism is to reduce the locking overhead on the shared resource (the ring). The hardware processes indirect buffers as quickly as it does the ring. Using indirect buffers also means we don't have to worry as often about context switches. > >>- Provide a timestamp mechanism (we already have one of these, but it could be >>a lot better) so that the client can age buffers on its own. >> > > This sounds useful regardless of what approach you take. What kind of > benefits do you see coming from an improved client managable buffer > aging mechanism? NV_vertex_array_range, as I say below. Simplificiation of our client -- we really have to work around fixed size dma buffers at the moment. >>--> Together, these make it very easy to implement NV_vertex_array_range, >>while at the same time simplifying the kernel module hugely. >> >>Some issues crop up: >> >>Backwards compatibility. The r200 has the same 2d core as the radeon, the >>existing radeon ddx driver works fine with the existing radeon.o kernel >>module. Should the r200 have a new kernel module with this functionality, or >>add to the existing one? All this new stuff would work fine with the radeon, >>so once the r200 is done, the radeon could move to these mechanisms for free, >>if they share a kernel module. >> > >>Furthermore, the existing 2d ddx code is written against the existing radeon.o >>kernel mechanisms, including dma buffer allocation. A new r200.o kernel >>module would have to duplicate that functionality, or I would have to rewrite >>the ddx code to use either the radeon.o or r200.o modules according to which >>was loaded -- this sounds ugly... >> > > >>On the other hand, if I keep a single module, I have to keep all the old >>radeon crud in the new r200 module -- and not all of it will even work with >>the r200. Additionally, I wonder what the point of cleaning up interfaces is >>if I have to keep all the old ones around too. >> >>Thoughts, anyone? >> > > I would suggest the most complete solution for forward progress AND > backward compatability is providing a single radeon kernel module that > support the OLD *and* NEW interfaces. Then move *all* user space > drivers forward to the NEW interface. Of course, newer user space > drivers wouldn't work on older kernels, but this have never been a > requirement...just a nice feature. > > I realize keeping the OLD interface intact is counter productive to > cleaning up the code. Take comfort in the idea that you can remove the > OLD interface later on when something else forces a break in backwards > compatability. > > If you do not have the time (or desire) to move the combined 2D driver > to the NEW interface, then you will be forced to abandon the NEW > interface or split the 2D driver into two different personalities (or > drivers); I don't see anyway around that if you hope to remove the OLD > interface someday. Yep. Keith |
From: Keith W. <ke...@tu...> - 2002-06-17 21:37:10
|
Keith Whitwell wrote: > Jens Owen wrote: > >> Keith Whitwell wrote: >> >>> OK, so I'm working on the r200 kernel interfaces & I'm kindof at a >>> crossroads. >>> >>> What I'd like to do looks like this, and is influenced to a great >>> degree by >>> the r200 sample implementation, but also by my latent desires for the >>> drm... >>> >>> - Scrap the existing dma buffer system entirely. >>> >>> - Provide an allocator for truely private agp & framebuffer memory. >>> This can >>> be used to house textures, commands (ie dma), backbuffers, depthbuffers, >>> display lists, etc. When allocation is required and fails, the >>> client falls >>> back to software rendering. >>> >> >> Is there any way to "swap out" AGP pages (or a subset of pages) from the >> GART, and replace them with additional pages to delay (perhaps >> indefinitely) the need for a SW fallback? Perhaps changing the GART >> table on a per context basis so each context has a 64M maximum (or >> whatever the chipset supports), but different context can have a >> different set of 64M pages. > > > Perhaps, but I think sw rendering is probably no slower. Actually this might not be true, but I don't know if I want to do the work to find out. Keith |
From: Jens O. <je...@tu...> - 2002-06-17 21:41:26
|
Keith Whitwell wrote: > > Keith Whitwell wrote: > > Jens Owen wrote: > > > >> Keith Whitwell wrote: > >> > >>> OK, so I'm working on the r200 kernel interfaces & I'm kindof at a > >>> crossroads. > >>> > >>> What I'd like to do looks like this, and is influenced to a great > >>> degree by > >>> the r200 sample implementation, but also by my latent desires for the > >>> drm... > >>> > >>> - Scrap the existing dma buffer system entirely. > >>> > >>> - Provide an allocator for truely private agp & framebuffer memory. > >>> This can > >>> be used to house textures, commands (ie dma), backbuffers, depthbuffers, > >>> display lists, etc. When allocation is required and fails, the > >>> client falls > >>> back to software rendering. > >>> > >> > >> Is there any way to "swap out" AGP pages (or a subset of pages) from the > >> GART, and replace them with additional pages to delay (perhaps > >> indefinitely) the need for a SW fallback? Perhaps changing the GART > >> table on a per context basis so each context has a 64M maximum (or > >> whatever the chipset supports), but different context can have a > >> different set of 64M pages. > > > > > > Perhaps, but I think sw rendering is probably no slower. > > Actually this might not be true, but I don't know if I want to do the work to > find out. Sounds like a good task for a strong kernel developer. Any volunteers? -- /\ Jens Owen / \/\ _ je...@tu... / \ \ \ Steamboat Springs, Colorado |
From: Linus T. <tor...@tr...> - 2002-06-17 23:28:20
|
Keith, I've got a silly question for you.. Why do you need a kernel driver at all for the R200? There are a few things that the kernel can do for you: - Locking. However, there are better (and faster) locks available in user space, namely the "futex" interface. They take some getting used to, but you can have some _truly_ low-cost locking using them. Example library can be found at: http://www.kernel.org/pub/linux/kernel/people/rusty/futex-2.0.tar.gz - Interrupts You don't use these right now, and as far as I can tell the main reason for using them would be to just synchronize page flipping with the framerate. No? - IOIO and IOMEM access iopl() gives access to IOIO mmap() and AGP driver gives access to IOMEM/AGP IOIO is actualy slightly slower in CPL3 than in CPL0, but it's slower in CPU cycles, not in IO cycles. And since IO cycles definitely dominate in IOIO (by orders of magnitude), this isn't likely to be an issue. And IOMEM is the same speed, since the only overhead for user space is the TLB, and AGP mappings use the TLB even in kernel space (vmalloc). - Global datastructures I think you do the aging right now globally or something. What else? Right now you cache some stuff globally (the ring tail ptr etc), but that isn't necessary: you can re-create the information on demand after a lock aquisition (since it is only needed when contention happens). So from what I can tell, a trusted entity doesn't strictly _need_ any kernel support. Yes, kernel support (or indirect rendering) is needed for untrusted applications, but it might actually be interesting to see what a direct-rendering all-user-land implementation looks like. It has some debugging advantages, and it may actually make sense to start from a totally trusted app that goes as fast as humanly possible, and then when that has been optimized to death look at just where the interfaces make the most sense.. (A user land implementation would imply fairly static AGP memory allocation, I guess). Linus |
From: Keith W. <ke...@tu...> - 2002-06-18 08:57:41
|
Linus Torvalds wrote: > Keith, > I've got a silly question for you.. > > Why do you need a kernel driver at all for the R200? I go into your mail below, but the only good answer I have is: 1) To allow us to mmap the framebuffer, agp and mmio regions (or to handle mmio for us without us mapping it) 2) Backwards compatibility. The ddx module is shared with the radeon & wants to talk to a kernel module. This can be worked around. > There are a few things that the kernel can do for you: > > - Locking. > > However, there are better (and faster) locks available in user > space, namely the "futex" interface. They take some getting used > to, but you can have some _truly_ low-cost locking using them. > > Example library can be found at: > http://www.kernel.org/pub/linux/kernel/people/rusty/futex-2.0.tar.gz I'm not sure how these are so much better in concept than the concept behind our existing lock. Both seem to have a userspace fast path (with a locked cycle) and a syscall/ioctl slow path on contention. The implementation of our lock has various workstation-leftovers like infrastructure for real virtualization of the hardware (kernel does context switching on lock contention), which aren't really used. > - Interrupts > > You don't use these right now, and as far as I can tell the main > reason for using them would be to just synchronize page flipping > with the framerate. No? Correct. > - IOIO and IOMEM access > > iopl() gives access to IOIO > mmap() and AGP driver gives access to IOMEM/AGP > > IOIO is actualy slightly slower in CPL3 than in CPL0, but it's > slower in CPU cycles, not in IO cycles. And since IO cycles > definitely dominate in IOIO (by orders of magnitude), this isn't > likely to be an issue. > > And IOMEM is the same speed, since the only overhead for > user space is the TLB, and AGP mappings use the TLB even in kernel > space (vmalloc). I'm not sure how this works. Does the agp module have a facility to allow the client to mmap the card mmio region & the framebuffer? I wasn't aware of this. > - Global datastructures > > I think you do the aging right now globally or something. > > What else? Right now you cache some stuff globally (the ring tail > ptr etc), but that isn't necessary: you can re-create the > information on demand after a lock aquisition (since it is only > needed when contention happens). Contention gives us a hint to check if the cliprects have changed. There's a fairly ugly mechanism for retrieving the new cliprects (drop hw lock, get a spin-type lock, send a request, get a reply, drop the spin-lock, re-aquire the hw lock). However - the check to see if this is necessary is cheap and the cliprects aren't required that often anyway. > So from what I can tell, a trusted entity doesn't strictly _need_ any > kernel support. > > Yes, kernel support (or indirect rendering) is needed for untrusted > applications, but it might actually be interesting to see what a > direct-rendering all-user-land implementation looks like. It has some > debugging advantages, and it may actually make sense to start from a > totally trusted app that goes as fast as humanly possible, and then when > that has been optimized to death look at just where the interfaces make > the most sense.. This is closer & closer to the Utah direct rendering model (not that I'm complaining...) In that model synchronization was achieved by having the X server be the only entity to touch the mmio region, but the client had direct access to a (large) dma buffer which it could ask the X server (via extended X11 protocol) to dispatch for it. The X server would take care of cliprect issues. This actually worked pretty well, but was limited to a single direct client (second & subsequent clients would go indirect, maybe sw-indirect, I forget). A little bit of work could extend that fairly easily to multiple clients. It also required that the direct client be run as root in order to mmap the framebuffer & dma region. I think it's probably time to start considering a rewrite/redesign of the 3d infrastructure based around a minimalist approach. There's just so much leftover code hanging around I have to ask what can be salvaged. Keith |
From: Michel <mi...@da...> - 2002-06-18 09:09:35
|
On Tue, 2002-06-18 at 10:57, Keith Whitwell wrote: >=20 > > - IOIO and IOMEM access > >=20 > > iopl() gives access to IOIO > > mmap() and AGP driver gives access to IOMEM/AGP > >=20 > > IOIO is actualy slightly slower in CPL3 than in CPL0, but it's > > slower in CPU cycles, not in IO cycles. And since IO cycles > > definitely dominate in IOIO (by orders of magnitude), this isn't > > likely to be an issue. > >=20 > > And IOMEM is the same speed, since the only overhead for > > user space is the TLB, and AGP mappings use the TLB even in kernel > > space (vmalloc). >=20 > I'm not sure how this works. Does the agp module have a facility to allo= w the=20 > client to mmap the card mmio region & the framebuffer? I wasn't aware = of this. I don't know about that, but framebuffer devices certainly do. Not sure if they will after the ongoing API changes in 2.5 though. And failing that, one could fall back to /dev/mem, like DGA clients. Not that I advocate that. :) Just two cents of mine... --=20 Earthling Michel D=E4nzer (MrCooper)/ Debian GNU/Linux (powerpc) developer XFree86 and DRI project member / CS student, Free Software enthusiast |
From: Benjamin H. <be...@ke...> - 2002-06-18 09:18:42
|
> - Interrupts > > You don't use these right now, and as far as I can tell the main > reason for using them would be to just synchronize page flipping > with the framerate. No? Which would be nice to have proper frame-sync on interlaced display (especially with Michel Danzer work on using DRM for Xv blits). > - IOIO and IOMEM access > > iopl() gives access to IOIO Which sucks on non-x86, but here XFree has it's own stuffs anyway > mmap() and AGP driver gives access to IOMEM/AGP That one is problematic. I don't support the mmap interface properly on Apple chipsets for example, because they don't support the AGP aperture beeing accessed by the CPU. I play mapping tricks for the in-kernel mapping of the aperture (using a home made agp_ioremap in the DRM) and I use special vm_ops for drmMap of the AGP so that the real mem pages get mapped in the client processes. I could do the same with the AGP driver, though the main problem with it currently is that clients using it via the ioctl interface tend to first mmap the aperture, then bind/unbind memory to/from it. I don't say that can't be fixed though ;) I would much prefer the agpgart interface to be redisigned around different semantics, mostly vmalloc() some space to use as AGP memory, then bind that to the GART, but don't rely on direct AGP aperture access. There are also some slight speed improvements to win using this sheme as I could map the AGP memory as cacheable (which would give a significant boost on PPC) provided buffers & ring get properly flushed before beeing "passed" to the chip. Ben. |
From: Linus T. <tor...@tr...> - 2002-06-18 16:10:06
|
On Mon, 17 Jun 2002, Benjamin Herrenschmidt wrote: > > > mmap() and AGP driver gives access to IOMEM/AGP > > That one is problematic. I don't support the mmap interface properly > on Apple chipsets for example, because they don't support the AGP > aperture beeing accessed by the CPU. I assume you mean that the CPU doesn't honour the AGP mappings, but the CPU _can_ access the physical pages themselves. How do you do it right now, since we seem to be doing "ioremap_nocache()" all over the place with the AGP aperture? But fundamentally that should not be a problem: we can map the (unmapped) AGP pages one page at a time (rather than as one contiguous block of remapped pages) into user mode. I thought AGP already supported a mmap() interface, and if it really doesn't, it should be trivial to do... [ Time passes, Linus looks at the sources ] Ok, there does seem to be mmap() support in the AGP module, but it seems to use that stupid "remap_page_range()" and the AGP base (similar to ioremap() inside the kernel), so it does seem to mmap the _mapped_ AGP area. It would be possible to just install a "nopage" handler, and map one page at a time on demand from the pool of (non-GART-mapped) pages that we keep in the gatt_table[] or whatever. Maybe there is some reason for doing it that way that I don't understand. More likely, it's just done that way because it was the simple and stupid approach. However, you seem to prefer a different approach, which would certainly work: > I would much prefer the agpgart interface to be redisigned around > different semantics, mostly vmalloc() some space to use as AGP memory, > then bind that to the GART, but don't rely on direct AGP aperture > access. > > There are also some slight speed improvements to win using this > sheme as I could map the AGP memory as cacheable (which would give a > significant boost on PPC) provided buffers & ring get properly flushed > before beeing "passed" to the chip. Hmm.. It would be fairly simple to do all page allocation in user space, and have an interface that says "put the physical page corresponding to my virtual address xxxx into the AGP aperture at offset yyyy". This would effectively disallow the above "map by unmapped page" approach, because it's too damn expensive to find and flush any existing mappings when somebody maps in a new page. And if not all systems support the GART-assisted CPU mapping that we do now, that means that nobody can mmap the AGP area into memory. The expensive part would be the "mark this page uncacheable" when moving it to the AGP buffer, which implies a cross-CPU TLB flush for each such page. So moving a page into the AGP aperture is fundamentally a fairly expensive operation: wbinvd itself takes a _loong_ time, but if you have to do it on all CPU's along with the TLB flush, it gets _really_ expensive. So moving pages that way is definitely not cheap either. Hmm. Linus |