From: Gareth H. <gar...@ac...> - 2002-05-16 22:42:08
|
I would like to propose a small change to the pthread_descr structure in the latest LinuxThreads code, to better support OpenGL on GNU/Linux systems (particularly on x86, but not excluding other platforms). The purpose of this patch is to provide efficient thread-local storage for both libGL itself and loadable OpenGL driver modules, so that they can be made thread-safe without any impact on performance. Indeed, using this mechanism, an OpenGL driver can ignore the difference between running with a single thread and running with multiple threads, as "global" data will be accessed in the same way independent of the number of threads running. To understand the need for such a change, one should consider what goes on inside an OpenGL implementation when an application makes an OpenGL API call. One of the primary tasks of the driver-independent libGL is to dispatch function calls to the driver backend(s), usually through a large function pointer table containing entries for the several hundred API entrypoints. Central to this process is the notion of a rendering context, or an abstraction of the OpenGL state machine. A context is required to perform OpenGL commands. The GLX specification states: Each thread can have at most one current rendering context. In addition, a rendering context can be current for only one thread at a time. The dispatch table for a context depends on the current state of OpenGL for that context, as things like display list compilation, display list playback and plain old immediate mode rendering change the behaviour of many API entrypoints. We see from the quote above that each thread has, at most, a single context, and this context has a single current dispatch table. The top-level API entrypoints can be implemented like the following: struct gl_dispatch { ... void (*Foo)(GLint bar); ... }; void glFoo(GLint bar) { struct gl_dispatch *current = __get_current_dispatch(); current->Foo(bar); } Similarly, a driver's implementation of the above entrypoint might look like the following: void __my_Foo(GLint bar) { struct gl_context *gc = __get_current_context(); /* remember the current setting of bar */ gc->state.current.bar = bar; /* do stuff with bar, like program hardware registers */ ... } We want __get_current_context() and __get_current_dispatch() (at a minimum) to be as efficient as possible, while still providing thread safety. Suppose we add a libGL-specific area to pthread_descr. This would allow us to implement these (and other similar) functions like so: void *__get_current_context(void) { pthread_descr self = thread_self(); return THREAD_GETMEM(self, p_libGL_specific[_LIBGL_TSD_KEY_CONTEXT]); } void *__get_current_dispatch(void) { pthread_descr self = thread_self(); return THREAD_GETMEM(self, p_libGL_specific[_LIBGL_TSD_KEY_DISPATCH]); } This would allow us to hand-code the top-level dispatch functions on x86 as: glFoo: movl %gs:__gl_context_offset, %eax jmp *__glapi_Foo(%eax) where __gl_context_offset is the byte offset of the thread-local context pointer and __glapi_Foo is the byte offset of the Foo entry in the dispatch table. Clearly this is an efficient implementation of the dispatch mechanism required by OpenGL, and is completely thread-safe to boot. With modern OpenGL applications and benchmarks dealing with datasets containing over 1 million vertices, with one or more function calls per vertex, you can see that an efficient dispatching mechanism is crucial for a high-performance OpenGL implementation. For example, the SPEC Viewperf benchmark's Light test (as described at http://www.spec.org/gpc/opc.static/light05.html) includes a subtest that renders over half a million wireframe primitives like so: GLfloat color[][4]; GLfloat position[][4]; glBegin(GL_LINE_LOOP); glColor3fv(color[i]); glVertex3fv(position[i]); glColor3fv(color[i+1]); glVertex3fv(position[i+1]); glColor3fv(color[i+2]); glVertex3fv(position[i+2]); glColor3fv(color[i+3]); glVertex3fv(position[i+3]); glEnd(); With 10 function calls per primitive, this equates to over 5 million function calls per frame. This is certainly a worst-case scenario, and there are certainly more efficient methods of rendering such large amounts of data, but Viewperf (the industry-standard OpenGL benchmark) deliberately stresses this path to measure the cost of API calls, as many workstation OpenGL apps (engineering, CAD and 3D modelling tools) still operate like this. An important point to understand is that the round trip through the API, into the driver and back out again for this immediate mode path can often be counted in tens of instructions. State of the art OpenGL implementations often do runtime code generation to implement these paths, resulting in a very lightweight driver backend for this part of the API. Clearly, the dispatching mechanism becomes a significant percentage of the total number of cycles here, and thus we want it to be as efficient as possible. It is worth noting that the Microsoft Windows implemenation of OpenGL for x86 has dedicated space in its per-thread data structures for things like the current context and dispatch table, so anything more than the above two instructions for the top-level entrypoint will put GNU/Linux at a disadvantage compared with that platform. For ease of implementation and delivery of a libGL that makes use of these features, I propose we add the libGL-specific thread local storage area in the space reserved for the p_header field at the start of the structure, like so: enum __libGL_tsd_key_t { _LIBGL_TSD_KEY_CONTEXT = 0, _LIBGL_TSD_KEY_DISPATCH, /* leave room for vendor-specific data */ _LIBGL_TSD_KEY_N = 8 }; struct _pthread_descr_struct { union { struct { void *tcb; union dtv *dtvp; pthread_descr self; } data; - void *__padding[16]; + void *__padding[8]; } p_header; + void *p_libGL_specific[_LIBGL_TSD_KEY_N]; pthread_descr p_nextlive, p_prevlive; pthread_descr p_nextwaiting; pthread_descr p_nextlock; ... }; This allows us to provide this functionality in libGL on all glibc-2.2 systems. glibc-2.1 can also be supported on x86 (given a functional kernel), as these versions did not use the segment registers for pthread_descr access. Given the importance of "free" thread local storage to an OpenGL implementation, I believe this change is warranted, even though it may not be the one chosen if the pthread_descr structure was being designed from scratch. The recent advancements in glibc's support for automatic thread-local storage using the __thread keyword (as described in Ulrich Drepper's To-Do list for glibc-2.3 under 'Implement TLS in dynamic linker', http://people.redhat.com/drepper/todo-2.3.html) are encouraging, and some may suggest that using this support would be more appropriate. However, the additional pointer dereferences to access these new thread-local storage areas will always be slower than the direct access scheme proposed above. Also, it will be some time before this system is widely available, where the changes proposed above are binary compatible with all glibc-2.2 releases. -- Gareth PS - I am not subscribed to libc-alpha, so please CC me on any replies. |