From: Gareth H. <gar...@ac...> - 2002-05-16 22:42:08
|
I would like to propose a small change to the pthread_descr structure in the latest LinuxThreads code, to better support OpenGL on GNU/Linux systems (particularly on x86, but not excluding other platforms). The purpose of this patch is to provide efficient thread-local storage for both libGL itself and loadable OpenGL driver modules, so that they can be made thread-safe without any impact on performance. Indeed, using this mechanism, an OpenGL driver can ignore the difference between running with a single thread and running with multiple threads, as "global" data will be accessed in the same way independent of the number of threads running. To understand the need for such a change, one should consider what goes on inside an OpenGL implementation when an application makes an OpenGL API call. One of the primary tasks of the driver-independent libGL is to dispatch function calls to the driver backend(s), usually through a large function pointer table containing entries for the several hundred API entrypoints. Central to this process is the notion of a rendering context, or an abstraction of the OpenGL state machine. A context is required to perform OpenGL commands. The GLX specification states: Each thread can have at most one current rendering context. In addition, a rendering context can be current for only one thread at a time. The dispatch table for a context depends on the current state of OpenGL for that context, as things like display list compilation, display list playback and plain old immediate mode rendering change the behaviour of many API entrypoints. We see from the quote above that each thread has, at most, a single context, and this context has a single current dispatch table. The top-level API entrypoints can be implemented like the following: struct gl_dispatch { ... void (*Foo)(GLint bar); ... }; void glFoo(GLint bar) { struct gl_dispatch *current = __get_current_dispatch(); current->Foo(bar); } Similarly, a driver's implementation of the above entrypoint might look like the following: void __my_Foo(GLint bar) { struct gl_context *gc = __get_current_context(); /* remember the current setting of bar */ gc->state.current.bar = bar; /* do stuff with bar, like program hardware registers */ ... } We want __get_current_context() and __get_current_dispatch() (at a minimum) to be as efficient as possible, while still providing thread safety. Suppose we add a libGL-specific area to pthread_descr. This would allow us to implement these (and other similar) functions like so: void *__get_current_context(void) { pthread_descr self = thread_self(); return THREAD_GETMEM(self, p_libGL_specific[_LIBGL_TSD_KEY_CONTEXT]); } void *__get_current_dispatch(void) { pthread_descr self = thread_self(); return THREAD_GETMEM(self, p_libGL_specific[_LIBGL_TSD_KEY_DISPATCH]); } This would allow us to hand-code the top-level dispatch functions on x86 as: glFoo: movl %gs:__gl_context_offset, %eax jmp *__glapi_Foo(%eax) where __gl_context_offset is the byte offset of the thread-local context pointer and __glapi_Foo is the byte offset of the Foo entry in the dispatch table. Clearly this is an efficient implementation of the dispatch mechanism required by OpenGL, and is completely thread-safe to boot. With modern OpenGL applications and benchmarks dealing with datasets containing over 1 million vertices, with one or more function calls per vertex, you can see that an efficient dispatching mechanism is crucial for a high-performance OpenGL implementation. For example, the SPEC Viewperf benchmark's Light test (as described at http://www.spec.org/gpc/opc.static/light05.html) includes a subtest that renders over half a million wireframe primitives like so: GLfloat color[][4]; GLfloat position[][4]; glBegin(GL_LINE_LOOP); glColor3fv(color[i]); glVertex3fv(position[i]); glColor3fv(color[i+1]); glVertex3fv(position[i+1]); glColor3fv(color[i+2]); glVertex3fv(position[i+2]); glColor3fv(color[i+3]); glVertex3fv(position[i+3]); glEnd(); With 10 function calls per primitive, this equates to over 5 million function calls per frame. This is certainly a worst-case scenario, and there are certainly more efficient methods of rendering such large amounts of data, but Viewperf (the industry-standard OpenGL benchmark) deliberately stresses this path to measure the cost of API calls, as many workstation OpenGL apps (engineering, CAD and 3D modelling tools) still operate like this. An important point to understand is that the round trip through the API, into the driver and back out again for this immediate mode path can often be counted in tens of instructions. State of the art OpenGL implementations often do runtime code generation to implement these paths, resulting in a very lightweight driver backend for this part of the API. Clearly, the dispatching mechanism becomes a significant percentage of the total number of cycles here, and thus we want it to be as efficient as possible. It is worth noting that the Microsoft Windows implemenation of OpenGL for x86 has dedicated space in its per-thread data structures for things like the current context and dispatch table, so anything more than the above two instructions for the top-level entrypoint will put GNU/Linux at a disadvantage compared with that platform. For ease of implementation and delivery of a libGL that makes use of these features, I propose we add the libGL-specific thread local storage area in the space reserved for the p_header field at the start of the structure, like so: enum __libGL_tsd_key_t { _LIBGL_TSD_KEY_CONTEXT = 0, _LIBGL_TSD_KEY_DISPATCH, /* leave room for vendor-specific data */ _LIBGL_TSD_KEY_N = 8 }; struct _pthread_descr_struct { union { struct { void *tcb; union dtv *dtvp; pthread_descr self; } data; - void *__padding[16]; + void *__padding[8]; } p_header; + void *p_libGL_specific[_LIBGL_TSD_KEY_N]; pthread_descr p_nextlive, p_prevlive; pthread_descr p_nextwaiting; pthread_descr p_nextlock; ... }; This allows us to provide this functionality in libGL on all glibc-2.2 systems. glibc-2.1 can also be supported on x86 (given a functional kernel), as these versions did not use the segment registers for pthread_descr access. Given the importance of "free" thread local storage to an OpenGL implementation, I believe this change is warranted, even though it may not be the one chosen if the pthread_descr structure was being designed from scratch. The recent advancements in glibc's support for automatic thread-local storage using the __thread keyword (as described in Ulrich Drepper's To-Do list for glibc-2.3 under 'Implement TLS in dynamic linker', http://people.redhat.com/drepper/todo-2.3.html) are encouraging, and some may suggest that using this support would be more appropriate. However, the additional pointer dereferences to access these new thread-local storage areas will always be slower than the direct access scheme proposed above. Also, it will be some time before this system is widely available, where the changes proposed above are binary compatible with all glibc-2.2 releases. -- Gareth PS - I am not subscribed to libc-alpha, so please CC me on any replies. |
From: Jakub J. <ja...@re...> - 2002-05-16 23:19:00
|
Hi! What percentage of applications use different dispatch tables among its threads? How often do dispatch table changes occur? If both of these are fairly low, computing a dispatch table in an awx section at dispatch table switch time might be fastest (ie. prepare something like: .section dispatch, "awx" .align 8 .globl glFoobar glFooBar: jmp something nop; nop; nop and <something> would be changed whenever a dispatch table switch happens for all dispatch table members). BTW: Last time I looked at libGL (in March), these were things which I came over: 1) libGL should IMHO use a version script (at least an anonymous one if you want to avoid assigning a specific GL_x.y symbol version to it), that way you get rid of thousands of expensive run-time relocations 2) last time I looked, libGL.so was linked unconditionally against libpthread. This is punnishing all non-threaded apps, weak undefined symbols work very well 3) I don't think building without -fpic is a good idea, 1) together with other tricks might speed things up while avoiding DT_TEXTREL overhead There were some other things, but I don't remember it very well. If I find time I'll build libGL again and check the disassembly. Jakub |
From: Gareth H. <gar...@ac...> - 2002-05-16 23:30:23
|
Jakub Jelinek wrote: > Hi! > > What percentage of applications use different dispatch > tables among its threads? How often do dispatch table changes > occur? If both of these are fairly low, computing a dispatch table > in an awx section at dispatch table switch time might be fastest > (ie. prepare something like: > .section dispatch, "awx" > .align 8 > .globl glFoobar > glFooBar: > jmp something > nop; nop; nop > > and <something> would be changed whenever a dispatch table switch happens > for all dispatch table members). That's not really feasible, as the tables can change very frequently (as often as every glBegin/glEnd, or maybe even every function call between glBegin and glEnd). Also, dispatch tables will *always* be different between threads, that's why they need to be accessed in a thread-safe manner. Finally, rewriting the instructions like this will have very bad trace cache behaviour on the Pentium 4, where touching instructions that have already been decoded causes the entire trace cache to be flushed. > BTW: Last time I looked at libGL (in March), these were things which I > came over: > 1) libGL should IMHO use a version script (at least an anonymous one > if you want to avoid assigning a specific GL_x.y symbol version to it), > that way you get rid of thousands of expensive run-time relocations Can you explain this in more detail? I'm not sure I understand what you're saying. > 2) last time I looked, libGL.so was linked unconditionally against > libpthread. This is punnishing all non-threaded apps, weak undefined > symbols work very well I agree. > 3) I don't think building without -fpic is a good idea, 1) together with > other tricks might speed things up while avoiding DT_TEXTREL > overhead Again, could you explain this in more detail? Thanks. -- Gareth |
From: Keith W. <ke...@tu...> - 2002-05-16 23:32:59
|
Jakub Jelinek wrote: > > Hi! > > What percentage of applications use different dispatch > tables among its threads? How often do dispatch table changes > occur? If both of these are fairly low, computing a dispatch table > in an awx section at dispatch table switch time might be fastest > (ie. prepare something like: > .section dispatch, "awx" > .align 8 > .globl glFoobar > glFooBar: > jmp something > nop; nop; nop > > and <something> would be changed whenever a dispatch table switch happens > for all dispatch table members). > > BTW: Last time I looked at libGL (in March), these were things which I > came over: > 1) libGL should IMHO use a version script (at least an anonymous one > if you want to avoid assigning a specific GL_x.y symbol version to it), > that way you get rid of thousands of expensive run-time relocations Where can I get info on this? > 2) last time I looked, libGL.so was linked unconditionally against > libpthread. This is punnishing all non-threaded apps, weak undefined > symbols work very well This is because we currently use the standard way of getting thread-local-data and detecting multi-thread situations. I'm not sure how Gareth is able to detect threaded vs. non-threaded situations without making any calls into the pthreads library, but once you know which one you're in, with his trick, you don't need to make any more. Currently we do something like this in MakeCurrent: void _glapi_check_multithread(void) { #if defined(THREADS) if (!ThreadSafe) { static unsigned long knownID; static GLboolean firstCall = GL_TRUE; if (firstCall) { knownID = _glthread_GetID(); firstCall = GL_FALSE; } else if (knownID != _glthread_GetID()) { ThreadSafe = GL_TRUE; } } if (ThreadSafe) { /* make sure that this thread's dispatch pointer isn't null */ if (!_glapi_get_dispatch()) { _glapi_set_dispatch(NULL); } } #endif } where _glthread_GetID() is really pthread_self(). How do you detect threading without making these calls to libpthreads.so? > 3) I don't think building without -fpic is a good idea, 1) together with > other tricks might speed things up while avoiding DT_TEXTREL > overhead The thing that really bites with -fpic is the bs you have to go through to get access to static symbols (forgive my loose terminology) like static variables or other functions you want to call. Gareth's trick means that two very important variables avoid this, but it's still going to be necessary to call other functions often enough... > There were some other things, but I don't remember it very well. If I find > time I'll build libGL again and check the disassembly. As someone who 1) is concerned about libGL performance and 2) doesn't know much about relocation/fpic/pthreads/etc, I'd love to hear anything you've got on this. Keith |
From: Gareth H. <gar...@ac...> - 2002-05-16 23:44:44
|
Keith Whitwell wrote: >> >>2) last time I looked, libGL.so was linked unconditionally against >> libpthread. This is punnishing all non-threaded apps, weak undefined >> symbols work very well > > > This is because we currently use the standard way of getting thread-local-data > and detecting multi-thread situations. I'm not sure how Gareth is able to > detect threaded vs. non-threaded situations without making any calls into the > pthreads library, but once you know which one you're in, with his trick, you > don't need to make any more. > > Currently we do something like this in MakeCurrent: > > void > _glapi_check_multithread(void) > { > #if defined(THREADS) > if (!ThreadSafe) { > static unsigned long knownID; > static GLboolean firstCall = GL_TRUE; > if (firstCall) { > knownID = _glthread_GetID(); > firstCall = GL_FALSE; > } > else if (knownID != _glthread_GetID()) { > ThreadSafe = GL_TRUE; > } > } > if (ThreadSafe) { > /* make sure that this thread's dispatch pointer isn't null */ > if (!_glapi_get_dispatch()) { > _glapi_set_dispatch(NULL); > } > } > #endif > } > > where _glthread_GetID() is really pthread_self(). > > How do you detect threading without making these calls to libpthreads.so? The important point is that you don't really need to detect threading anymore. The Linux OpenGL ABI states that multithreaded apps must link with pthreads. Thus, at startup, you can detect the presence of pthreads or otherwise. Basically, if pthreads is present, you just use the pthread_descr that it set up, otherwise you create a dummy one and plug it into the segment registers (or whatever) and be done with it. From that point on, you don't care how many threads there are. Accessing "global" data is always done the same way, independant of the number of threads running. In any case, it would be great to remove the need of apps that link with libGL to also link with pthreads, and to force the use of pthreads even for single-threaded apps. > The thing that really bites with -fpic is the bs you have to go through to get > access to static symbols (forgive my loose terminology) like static variables > or other functions you want to call. Gareth's trick means that two very > important variables avoid this, but it's still going to be necessary to call > other functions often enough... I'd like to hear a strong arguement as to why you *would* want to link with -fpic. Like Keith, I'm also not familiar with some of the more in-depth aspects w.r.t. relocation/fpic etc, so feel free to enlighten us. -- Gareth |
From: Keith W. <ke...@tu...> - 2002-05-16 23:49:32
|
Gareth Hughes wrote: > > Keith Whitwell wrote: > >> > >>2) last time I looked, libGL.so was linked unconditionally against > >> libpthread. This is punnishing all non-threaded apps, weak undefined > >> symbols work very well > > > > > > This is because we currently use the standard way of getting thread-local-data > > and detecting multi-thread situations. I'm not sure how Gareth is able to > > detect threaded vs. non-threaded situations without making any calls into the > > pthreads library, but once you know which one you're in, with his trick, you > > don't need to make any more. > > > > Currently we do something like this in MakeCurrent: > > > > void > > _glapi_check_multithread(void) > > { > > #if defined(THREADS) > > if (!ThreadSafe) { > > static unsigned long knownID; > > static GLboolean firstCall = GL_TRUE; > > if (firstCall) { > > knownID = _glthread_GetID(); > > firstCall = GL_FALSE; > > } > > else if (knownID != _glthread_GetID()) { > > ThreadSafe = GL_TRUE; > > } > > } > > if (ThreadSafe) { > > /* make sure that this thread's dispatch pointer isn't null */ > > if (!_glapi_get_dispatch()) { > > _glapi_set_dispatch(NULL); > > } > > } > > #endif > > } > > > > where _glthread_GetID() is really pthread_self(). > > > > How do you detect threading without making these calls to libpthreads.so? > > The important point is that you don't really need to detect threading > anymore. The Linux OpenGL ABI states that multithreaded apps must link > with pthreads. Thus, at startup, you can detect the presence of > pthreads or otherwise. Basically, if pthreads is present, you just use > the pthread_descr that it set up, otherwise you create a dummy one and > plug it into the segment registers (or whatever) and be done with it. > From that point on, you don't care how many threads there are. > Accessing "global" data is always done the same way, independant of the > number of threads running. Hmm. And does libpthreads.so *always* set this up -- is it possible to link to libpthreads.so but not actually use it, spoofing your detection? > In any case, it would be great to remove the need of apps that link with > libGL to also link with pthreads, and to force the use of pthreads even > for single-threaded apps. I agree. > > The thing that really bites with -fpic is the bs you have to go through to get > > access to static symbols (forgive my loose terminology) like static variables > > or other functions you want to call. Gareth's trick means that two very > > important variables avoid this, but it's still going to be necessary to call > > other functions often enough... > > I'd like to hear a strong arguement as to why you *would* want to link > with -fpic. Like Keith, I'm also not familiar with some of the more > in-depth aspects w.r.t. relocation/fpic etc, so feel free to enlighten us. Me also. Keith |
From: Jakub J. <ja...@re...> - 2002-05-17 06:38:59
|
On Fri, May 17, 2002 at 12:30:56AM +0100, Keith Whitwell wrote: > > BTW: Last time I looked at libGL (in March), these were things which I > > came over: > > 1) libGL should IMHO use a version script (at least an anonymous one > > if you want to avoid assigning a specific GL_x.y symbol version to it), > > that way you get rid of thousands of expensive run-time relocations > > Where can I get info on this? info ld, study glibc sources. Just checked the latest libGL.so, it has: 0x00000016 (TEXTREL) 0x0 0x6ffffffa (RELCOUNT) 3615 0x00000000 (NULL) 0x0 Relocation section '.rel.dyn' at offset 0x14074 contains 12341 entries: This effectively means libGL.so is not a shared library at all and takes awfully lot of time to load during program startup. What you can and should do: create either linker script like: { global: gl*; DRI*; XF86DRI*; local: *; } or GL_1.0 { global: gl*; DRI*; XF86DRI*; local: *; } then pass this file to linker at libGL.so link time, like gcc -shared ... -Wl,--version-script,libGL.map ... This way you get rid of most R_386_PC32 relocations, etc. If libGL.so is compiled with -fpic, you should also in headers prototyping the internal functions add __attribute__((visibility("hidden"))) to the prototypes, so that gcc can avoid setting up pic pointers when calling the internal functions. I think most of the functions are prototyped using macros, so it wouldn't be much work. When you are calling the exported functions internally and there is no reason why applications should be able to interpose those symbols, create internal aliases for them (like __gl* or __internal_gl* or whatever) and use them instead (again, can be seen a lot in glibc sources and helps a lot). > How do you detect threading without making these calls to libpthreads.so? Weak symbols. extern pthread_t pthread_self (void) __attribute__ ((weak)); (resp. #pragma weak pthread_self). Then if (&pthread_self) { application was linked against -lpthread } else { not threaded } Note all calls to pthread library has to be weak. > > 3) I don't think building without -fpic is a good idea, 1) together with > > other tricks might speed things up while avoiding DT_TEXTREL > > overhead > > The thing that really bites with -fpic is the bs you have to go through to get > access to static symbols (forgive my loose terminology) like static variables > or other functions you want to call. Gareth's trick means that two very > important variables avoid this, but it's still going to be necessary to call > other functions often enough... Well, if you use the above things, then calls to other functions local to libGL.so (something like static functions, though they can be used accross different .o files in the whole library) will be simple call instructions, no need to setup pic pointer or anything. For static data variables I'd have to check libGL assembly, surely something can be done for the functions where this matters. Note that libGL is typical library using tons of macros where by changing the macros you can easily tweak most of the functions which matter for performance (so it is very good to study closely the resulting assembly). > > > There were some other things, but I don't remember it very well. If I find > > time I'll build libGL again and check the disassembly. > > As someone who 1) is concerned about libGL performance and 2) doesn't know > much about relocation/fpic/pthreads/etc, I'd love to hear anything you've got > on this. If you don't use -fpic, libGL.so is basically not shared, takes way much time to load and eats moer unshareable memory. There are ways how you can combine the advantages of -fpic without taking the performance hit. Jakub |
From: Momchil V. <ve...@fa...> - 2002-05-17 07:23:05
|
>>>>> "Jakub" == Jakub Jelinek <ja...@re...> writes: >> How do you detect threading without making these calls to libpthreads.so? Jakub> Weak symbols. Jakub> extern pthread_t pthread_self (void) __attribute__ ((weak)); Jakub> (resp. #pragma weak pthread_self). Jakub> Then if (&pthread_self) { application was linked against -lpthread } else { not threaded } Jakub> Note all calls to pthread library has to be weak. Hmm, if all the references to libpthread are weak, what will pull libpthread in ? |
From: Jakub J. <ja...@re...> - 2002-05-17 07:26:10
|
On Fri, May 17, 2002 at 10:22:23AM +0300, Momchil Velikov wrote: > >>>>> "Jakub" == Jakub Jelinek <ja...@re...> writes: > > >> How do you detect threading without making these calls to libpthreads.so? > > Jakub> Weak symbols. > > Jakub> extern pthread_t pthread_self (void) __attribute__ ((weak)); > > Jakub> (resp. #pragma weak pthread_self). > Jakub> Then if (&pthread_self) { application was linked against -lpthread } else { not threaded } > Jakub> Note all calls to pthread library has to be weak. > > Hmm, if all the references to libpthread are weak, what will pull > libpthread in ? Nothing, that's the point. If an application is threaded, it has to be linked with -lpthread (or some of its libraries). In that case the weak symbols in libGL will resolve to that library and libGL will be thread-safe. If an application is not threaded, you don't want to link against -lpthread. Jakub |
From: Keith W. <ke...@tu...> - 2002-05-17 09:50:10
|
Jakub Jelinek wrote: > > On Fri, May 17, 2002 at 12:30:56AM +0100, Keith Whitwell wrote: > > > BTW: Last time I looked at libGL (in March), these were things which I > > > came over: > > > 1) libGL should IMHO use a version script (at least an anonymous one > > > if you want to avoid assigning a specific GL_x.y symbol version to it), > > > that way you get rid of thousands of expensive run-time relocations > > > > Where can I get info on this? > > info ld, study glibc sources. > > Just checked the latest libGL.so, it has: > 0x00000016 (TEXTREL) 0x0 > 0x6ffffffa (RELCOUNT) 3615 > 0x00000000 (NULL) 0x0 > > Relocation section '.rel.dyn' at offset 0x14074 contains 12341 entries: > > This effectively means libGL.so is not a shared library at all and takes > awfully lot of time to load during program startup. Yes, we've been aware of this for a little while. One thing that we've got a bit of in there is assembly for the non-threaded dispatch case (the opensource libGL.so doesn't really handle the threaded case in a performant way, but we've made some effort on the non-threaded case), that looks a bit like this: ALIGNTEXT16 GLOBL_FN(GL_PREFIX(NewList)) GL_PREFIX(NewList): MOV_L(GLNAME(_glapi_Dispatch), EAX) JMP(GL_OFFSET(_gloffset_NewList)) This generates the library entrypoint 'glNewList', which just grabs the active dispatch table and jumps to the real function. I had some emails with HJ Lu about this, but didn't really get what he was saying. Are these a problem for building with -fPIC? I'm not really interested in giving this up as I believe any benefits from -fPIC will be quickly outweighed by any loss at the dispatch layer. > What you can and should do: > > create either linker script like: > { global: gl*; DRI*; XF86DRI*; local: *; } > or > GL_1.0 { global: gl*; DRI*; XF86DRI*; local: *; } > then pass this file to linker at libGL.so link time, like > gcc -shared ... -Wl,--version-script,libGL.map ... I'm playing with this now. Keith |
From: Jakub J. <ja...@re...> - 2002-05-17 11:59:58
|
On Fri, May 17, 2002 at 10:48:07AM +0100, Keith Whitwell wrote: > Yes, we've been aware of this for a little while. One thing that we've got a > bit of in there is assembly for the non-threaded dispatch case (the opensource > libGL.so doesn't really handle the threaded case in a performant way, but > we've made some effort on the non-threaded case), that looks a bit like this: > > ALIGNTEXT16 > GLOBL_FN(GL_PREFIX(NewList)) > GL_PREFIX(NewList): > MOV_L(GLNAME(_glapi_Dispatch), EAX) > JMP(GL_OFFSET(_gloffset_NewList)) > > This generates the library entrypoint 'glNewList', which just grabs the active > dispatch table and jumps to the real function. I had some emails with HJ Lu > about this, but didn't really get what he was saying. Are these a problem for > building with -fPIC? I'm not really interested in giving this up as I believe > any benefits from -fPIC will be quickly outweighed by any loss at the dispatch > layer. Well, if you do this, you should at least put this into an .section Gltext, "awx" so that it is not DT_TEXTREL. But I still wonder, how often will the target this jumps to change during lifetime of typical GL application if using GLX extensions. Won't it be most of the time the __indirect_* variant, even for threaded apps? Or are they being changed between __indirect_*, noop and software rendering all the time in typical application? If changing it is rare, I think my proposal with jmp something; nop; nop; nop and changing it at setdispatch time if something changed should be faster. Other things: concerning compsize.c routines, I think they should be at least inlined if not killed and replaced by switch () statements doing copy by hand (with fallthrough's). When they are out of line, gcc cannot even guess what values it has, has to do the multiplication at runtime and call memcpy, which is a hop through .plt and you loose the information that you're e.g. copying only 1, 2 or 4 words. Jakub |
From: David S. M. <da...@re...> - 2002-05-17 12:05:10
|
From: Jakub Jelinek <ja...@re...> Date: Fri, 17 May 2002 07:58:37 -0400 Well, if you do this, you should at least put this into an .section Gltext, "awx" so that it is not DT_TEXTREL. But I still wonder, how often will the target this jumps to change during lifetime of typical GL application if using GLX extensions. Every time you record or play a copiled vertex array. OpenGL provides a set of APIs where you are encapsulating "programs" of OpenGL calls into a "CVA" (compiled vertex array) so what OpenGL does when you start recording is: change_opengl_api_funcptrs_to_CVA_record(); So glVertex3f() gets diverted to __glVertex3F_RECORD_FOR_CVA() Similar things happen when you ask OpenGL to play it back later. Consider when the device can directly do the gl*() function call. The GLAPI table gets directed to the device driver. If you change some graphics rendering attribute, the card may not be able to do it directly in hardware without some SW help from MESA, and the GLAPI function pointer changes again. There are thousands upon thousands of ways the OpenGL state machine can change and have the GLAPI functions pointers get updated, it happens all the time. Franks a lot, David S. Miller da...@re... |
From: Keith W. <ke...@tu...> - 2002-05-17 12:19:26
|
Jakub Jelinek wrote: > On Fri, May 17, 2002 at 10:48:07AM +0100, Keith Whitwell wrote: > >>Yes, we've been aware of this for a little while. One thing that we've got a >>bit of in there is assembly for the non-threaded dispatch case (the opensource >>libGL.so doesn't really handle the threaded case in a performant way, but >>we've made some effort on the non-threaded case), that looks a bit like this: >> >>ALIGNTEXT16 >>GLOBL_FN(GL_PREFIX(NewList)) >>GL_PREFIX(NewList): >> MOV_L(GLNAME(_glapi_Dispatch), EAX) >> JMP(GL_OFFSET(_gloffset_NewList)) >> >>This generates the library entrypoint 'glNewList', which just grabs the active >>dispatch table and jumps to the real function. I had some emails with HJ Lu >>about this, but didn't really get what he was saying. Are these a problem for >>building with -fPIC? I'm not really interested in giving this up as I believe >>any benefits from -fPIC will be quickly outweighed by any loss at the dispatch >>layer. >> > > Well, if you do this, you should at least put this into an > .section Gltext, "awx" > so that it is not DT_TEXTREL. > But I still wonder, how often will the target this jumps to change > during lifetime of typical GL application if using GLX extensions. > Won't it be most of the time the __indirect_* variant, even for threaded > apps? Or are they being changed between __indirect_*, noop and > software rendering all the time in typical application? The __indirect stuff is rarely used - it packages stuff up and sends it over a pipe (or network connection) for the X server to work on. The real meat of the driver is in a separately dlopened 'driver.so' backend which peforms 'direct rendering' -- direct access to the hardware from the application (typically via a kernel dma engine and various mediation/locking schemes supporting multiple clients plus the X server all banging away at the same piece of hardware at once). The driver.so is kept separate and dlopened to cope with people changing video cards or indeed having more than one installed. > If changing it is rare, I think my proposal with jmp something; nop; nop; nop > and changing it at setdispatch time if something changed should be faster. In normal use it changes very frequently. The GL api is specified as a state-machine, lends itself well to this type of implementation. Historically Mesa didn't have a proper dispatch layer, so we under-used this facility - that is changing however... > Other things: concerning compsize.c routines, I think they should be > at least inlined if not killed and replaced by switch () statements > doing copy by hand (with fallthrough's). Maybe, but that's on the indirect path and nobody really cares about the performance there. There are better alternatives for efficient remote GL (see www.chromium.org). Keith |
From: Jakub J. <ja...@re...> - 2002-05-17 12:43:17
|
On Fri, May 17, 2002 at 01:17:22PM +0100, Keith Whitwell wrote: > The __indirect stuff is rarely used. Ok, fine. But if it is rarely used, why isn't it -fpic? Few more random things which might help the drivers: 1) using fprintf (stderr, ...) all around is a bad idea, that means 2 runtime relocations for each when not -fpic, much better take it offline into a separate .hidden function which will call vfprintf 2) use __builtin_expect if you know better than the compiler what's likely and what is unlikely executed Jakub |
From: Keith W. <ke...@tu...> - 2002-05-17 13:36:51
|
Jakub Jelinek wrote: > On Fri, May 17, 2002 at 01:17:22PM +0100, Keith Whitwell wrote: > >>The __indirect stuff is rarely used. >> > > Ok, fine. But if it is rarely used, why isn't it -fpic? The dispatch table in libGL.so is used all the time - this thread was started by Gareth with discussion of an optimization to that code (which also applies to another thread-local variable used by the 'driver.so' backend). The __indirect stuff is just along for the ride. It's a fallback for network-transparent rendering, broken configs, and a couple of actual useful cases. I don't think it can be split out completely however as we still have to send a few packets off to the X server. The use of sections you described earlier sounds like we can get that code to be PIC and still have a fast dispatch, which is nice. It sounds like you haven't looked at the driver backends - this is where most of the code is. Our most up-to-date driver illustrates some of Gareth's concerns and is on the tcl-0-0-branch of DRI CVS in the lib/GL/mesa/src/drv/radeon directory. This does some of the runtime code-generation and dispatch-flipping that he is talking about. > Few more random things which might help the drivers: > 1) using fprintf (stderr, ...) all around is a bad idea, > that means 2 runtime relocations for each when not -fpic, much better > take it offline into a separate .hidden function which will call > vfprintf That's reasonable. In a lot of cases these aren't compiled into release or non-debug drivers, but in the tcl version, I've found it very helpful to keep them in there. > 2) use __builtin_expect if you know better than the compiler what's likely > and what is unlikely executed I'll have to look this one up too. Keith |
From: Keith W. <ke...@tu...> - 2002-05-17 10:52:15
|
> create either linker script like: > { global: gl*; DRI*; XF86DRI*; local: *; } > or > GL_1.0 { global: gl*; DRI*; XF86DRI*; local: *; } > then pass this file to linker at libGL.so link time, like > gcc -shared ... -Wl,--version-script,libGL.map ... > > This way you get rid of most R_386_PC32 relocations, etc. > If libGL.so is compiled with -fpic, you should also in headers prototyping > the internal functions add __attribute__((visibility("hidden"))) > to the prototypes, so that gcc can avoid setting up pic pointers > when calling the internal functions. I think most of the functions are > prototyped using macros, so it wouldn't be much work. OK, trying the __attribute__ stuff doesn't seem to work for me: f12.h:3: warning: `visibility' attribute directive ignored f12.h:4: warning: `visibility' attribute directive ignored Where f12.h looks like: --------------------------- #define INTERNAL __attribute__((visibility("hidden"))) extern void foo1( void ) INTERNAL; extern void foo2( void ) INTERNAL; extern void glfoo2( void ); --------------------------- And gcc -v gives: gcc version 2.96 20000731 (Mandrake Linux 8.1 2.96-0.62mdk) Is this a versions thing or am I screwing up elsewhere? Keith |
From: Jakub J. <ja...@re...> - 2002-05-17 11:02:10
|
On Fri, May 17, 2002 at 11:50:10AM +0100, Keith Whitwell wrote: > OK, trying the __attribute__ stuff doesn't seem to work for me: > > > f12.h:3: warning: `visibility' attribute directive ignored > f12.h:4: warning: `visibility' attribute directive ignored > > > Where f12.h looks like: > > --------------------------- > #define INTERNAL __attribute__((visibility("hidden"))) > > extern void foo1( void ) INTERNAL; > extern void foo2( void ) INTERNAL; > extern void glfoo2( void ); > --------------------------- > > > And gcc -v gives: > > gcc version 2.96 20000731 (Mandrake Linux 8.1 2.96-0.62mdk) > > Is this a versions thing or am I screwing up elsewhere? It is a very new thing. Only GCC 3.2, the Red Hat GCC 3.1 package and gcc-2.96-RH >= 2.96-108 (dunno which Mandrake version incorporates those changes). If the warning bothers you, then you probably need to add some configury to check whether gcc supports this and define the macro only if that is the case; but the warnings are harmless. Concerning the version script, anonymous version script are fairly new too (that's the { global: ...; local: *; } ) in binutils unlike the standard version scripts (that's GL_1.0 { global: ...; local: *; } ). Solaris ld accepts both forms for a long time though (but has other limitations). Jakub |
From: Keith W. <ke...@tu...> - 2002-05-17 11:10:18
|
Jakub Jelinek wrote: >>And gcc -v gives: >> >>gcc version 2.96 20000731 (Mandrake Linux 8.1 2.96-0.62mdk) >> >>Is this a versions thing or am I screwing up elsewhere? >> > > It is a very new thing. > Only GCC 3.2, the Red Hat GCC 3.1 package and gcc-2.96-RH >= 2.96-108 > (dunno which Mandrake version incorporates those changes). > If the warning bothers you, then you probably need to add some configury > to check whether gcc supports this and define the macro only > if that is the case; but the warnings are harmless. OK. Thanks for the info - it does reduce the utility somewhat, but at least we can do this for pre-built binaries. I don't think too many people have a compliant compiler yet... Keith |
From: Gareth H. <gar...@ac...> - 2002-05-16 23:49:04
|
Jakub Jelinek wrote: > Hi! > > What percentage of applications use different dispatch > tables among its threads? How often do dispatch table changes > occur? If both of these are fairly low, computing a dispatch table > in an awx section at dispatch table switch time might be fastest I should also point out that display list compilation and playback is another place where the dispatch table changes (typically, you have at least a dispatch table for regular immediate mode, display list compilation and display list playback). One of the ugliest things about the Microsoft Windows implementation of OpenGL is that the driver backend must call a function to register a new dispatch table, and the OpenGL library then makes several copies of this table internally. Being able to switch dispatch tables with a single pointer reassignment makes it easy to do very powerful optimizations. -- Gareth |
From: Gareth H. <gar...@ac...> - 2002-05-17 04:36:54
|
A question about the __thread stuff: does it require -fPIC? What happens if you don't compile a library with -fPIC, and have __thread variables declared in that library? -- Gareth |
From: Ulrich D. <dr...@re...> - 2002-05-17 00:32:26
|
On Thu, 2002-05-16 at 15:41, Gareth Hughes wrote: > I would like to propose a small change to the pthread_descr structure in > the latest LinuxThreads code, to better support OpenGL on GNU/Linux > systems (particularly on x86, but not excluding other platforms). The > purpose of this patch is to provide efficient thread-local storage for > both libGL itself and loadable OpenGL driver modules, glibc already supports the thread-local storage extension to ELF and so does binutils. Only gcc support is missing but you can work around this with asm. Once gcc has the support you'll be able to write __thread some_type some_var; as if some_var would be a global variable. In fact it'll be thread-specific. This is the only way you'll get access to thread-local storage. It is out of question to allow third party program peek and poke into the thread descriptor. --=20 ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ |
From: Gareth H. <gar...@ac...> - 2002-05-17 00:55:05
|
Ulrich Drepper wrote: > > This is the only way you'll get access to thread-local storage. It is > out of question to allow third party program peek and poke into the > thread descriptor. What do you mean, a third party program? We're talking about a system library (libGL.so) here. There is a similar shortcut for libc (p_libc_specific) already in there. -- Gareth |
From: Ulrich D. <dr...@re...> - 2002-05-17 01:45:44
|
On Thu, 2002-05-16 at 17:54, Gareth Hughes wrote: > What do you mean, a third party program? We're talking about a system=20 > library (libGL.so) here. Everything which is not part of glibc is third-party. It's the same as if some program would require access to internal data structures of libGL. There are several different layouts of the thread descriptor and it's only getting worse. The actual layout doesn't matter since everything is internal to glibc and the other libraries which come with it so this is no problem. Beside, I don't understand why you react like this. Using __thread is the best you can ever get. It'll be portable (Solaris 9 already has the support as well) and it's faster than anything you can get to access the data. --=20 ---------------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com `------------------------ |
From: Gareth H. <gar...@ac...> - 2002-05-17 03:08:27
|
Ulrich Drepper wrote: > > Everything which is not part of glibc is third-party. It's the same as > if some program would require access to internal data structures of > libGL. There are several different layouts of the thread descriptor and > it's only getting worse. The actual layout doesn't matter since > everything is internal to glibc and the other libraries which come with > it so this is no problem. I don't understand the reference to the multiple layouts of the thread descriptor structure. Can you exlain this? > Beside, I don't understand why you react like this. Using __thread is > the best you can ever get. It'll be portable (Solaris 9 already has the > support as well) and it's faster than anything you can get to access the > data. I disagree that __thread is the best you can ever get. In the best case, you have an extra load and subtraction before you have the address of a thread-local variable. In the worst case, you have a function call in there as well. That is: movl %gs:0,%eax subl $foo@tpoff,%eax movl (%eax),%eax jmp *1234(%eax) versus: movl %gs:32,%eax jmp *1234(%eax) for instance. When the function you are jumping to consists of five or six instructions, say, an extra two instructions are significant. Recall that a competing operating system on x86 allows access to the context and dispatch pointers with two instructions, so what you are suggesting will mean we always have an inferior implementation. You also need -fpic, which burns a whole register. This is a non-trivial sacrifice, particularly on x86. Let's be clear about what I'm proposing: you agree to reserve an 8*sizeof(void *) block at a well-defined and well-known offset in the TCB. OpenGL is free to access that block, but only that block. -- Gareth |
From: Gareth H. <gar...@ac...> - 2002-05-17 03:12:22
|
Gareth Hughes wrote: > > Let's be clear about what I'm proposing: you agree to reserve an > 8*sizeof(void *) block at a well-defined and well-known offset in the > TCB. Of course, I should add that space for such a block exists, and has existed for some time. My proposal requires no real changes on the glibc side of things, other than to set in stone the agreement between pthreads and OpenGL to ensure this block is there in the future. -- Gareth |