[Dri-devel] OpenGL and the LinuxThreads pthread_descr structure

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I would like to propose a small change to the pthread_descr structure in
the latest LinuxThreads code, to better support OpenGL on GNU/Linux
systems (particularly on x86, but not excluding other platforms).  The
purpose of this patch is to provide efficient thread-local storage for
both libGL itself and loadable OpenGL driver modules, so that they can
be made thread-safe without any impact on performance.  Indeed, using
this mechanism, an OpenGL driver can ignore the difference between
running with a single thread and running with multiple threads, as
"global" data will be accessed in the same way independent of the number
of threads running.

To understand the need for such a change, one should consider what goes
on inside an OpenGL implementation when an application makes an OpenGL
API call.  One of the primary tasks of the driver-independent libGL is
to dispatch function calls to the driver backend(s), usually through a
large function pointer table containing entries for the several hundred
API entrypoints.  Central to this process is the notion of a rendering
context, or an abstraction of the OpenGL state machine.  A context is
required to perform OpenGL commands.  The GLX specification states:

     Each thread can have at most one current rendering context. In
     addition, a rendering context can be current for only one thread
     at a time.

The dispatch table for a context depends on the current state of OpenGL
for that context, as things like display list compilation, display list
playback and plain old immediate mode rendering change the behaviour of
many API entrypoints.  We see from the quote above that each thread has,
at most, a single context, and this context has a single current
dispatch table.

The top-level API entrypoints can be implemented like the following:

     struct gl_dispatch {
       ...
       void (*Foo)(GLint bar);
       ...
     };

     void glFoo(GLint bar)
     {
       struct gl_dispatch *current = __get_current_dispatch();
       current->Foo(bar);
     }

Similarly, a driver's implementation of the above entrypoint might look
like the following:

     void __my_Foo(GLint bar)
     {
       struct gl_context *gc = __get_current_context();

       /* remember the current setting of bar */
       gc->state.current.bar = bar;

       /* do stuff with bar, like program hardware registers */
       ...
     }

We want __get_current_context() and __get_current_dispatch() (at a
minimum) to be as efficient as possible, while still providing thread
safety.  Suppose we add a libGL-specific area to pthread_descr.  This
would allow us to implement these (and other similar) functions like so:

     void *__get_current_context(void)
     {
       pthread_descr self = thread_self();
       return THREAD_GETMEM(self,
                            p_libGL_specific[_LIBGL_TSD_KEY_CONTEXT]);
     }

     void *__get_current_dispatch(void)
     {
       pthread_descr self = thread_self();
       return THREAD_GETMEM(self,
                            p_libGL_specific[_LIBGL_TSD_KEY_DISPATCH]);
     }

This would allow us to hand-code the top-level dispatch functions on x86
as:

     glFoo:
         movl %gs:__gl_context_offset, %eax
         jmp *__glapi_Foo(%eax)

where __gl_context_offset is the byte offset of the thread-local context
pointer and __glapi_Foo is the byte offset of the Foo entry in the
dispatch table.  Clearly this is an efficient implementation of the
dispatch mechanism required by OpenGL, and is completely thread-safe to
boot.

With modern OpenGL applications and benchmarks dealing with datasets
containing over 1 million vertices, with one or more function calls per
vertex, you can see that an efficient dispatching mechanism is crucial
for a high-performance OpenGL implementation.  For example, the SPEC
Viewperf benchmark's Light test (as described at
http://www.spec.org/gpc/opc.static/light05.html) includes a subtest that
renders over half a million wireframe primitives like so:

     GLfloat color[][4];
     GLfloat position[][4];

     glBegin(GL_LINE_LOOP);
       glColor3fv(color[i]);
       glVertex3fv(position[i]);
       glColor3fv(color[i+1]);
       glVertex3fv(position[i+1]);
       glColor3fv(color[i+2]);
       glVertex3fv(position[i+2]);
       glColor3fv(color[i+3]);
       glVertex3fv(position[i+3]);
     glEnd();

With 10 function calls per primitive, this equates to over 5 million
function calls per frame.  This is certainly a worst-case scenario, and
there are certainly more efficient methods of rendering such large
amounts of data, but Viewperf (the industry-standard OpenGL benchmark)
deliberately stresses this path to measure the cost of API calls, as
many workstation OpenGL apps (engineering, CAD and 3D modelling tools)
still operate like this.

An important point to understand is that the round trip through the API,
into the driver and back out again for this immediate mode path can
often be counted in tens of instructions.  State of the art OpenGL
implementations often do runtime code generation to implement these
paths, resulting in a very lightweight driver backend for this part of
the API.  Clearly, the dispatching mechanism becomes a significant
percentage of the total number of cycles here, and thus we want it to be
as efficient as possible.

It is worth noting that the Microsoft Windows implemenation of OpenGL
for x86 has dedicated space in its per-thread data structures for things
like the current context and dispatch table, so anything more than the
above two instructions for the top-level entrypoint will put GNU/Linux
at a disadvantage compared with that platform.

For ease of implementation and delivery of a libGL that makes use of
these features, I propose we add the libGL-specific thread local storage
area in the space reserved for the p_header field at the start of the
structure, like so:

     enum __libGL_tsd_key_t {
       _LIBGL_TSD_KEY_CONTEXT = 0,
       _LIBGL_TSD_KEY_DISPATCH,
       /* leave room for vendor-specific data */
       _LIBGL_TSD_KEY_N = 8
     };

     struct _pthread_descr_struct {
       union {
         struct {
           void *tcb;
           union dtv *dtvp;
           pthread_descr self;
         } data;
-       void *__padding[16];
+       void *__padding[8];
       } p_header;
+     void *p_libGL_specific[_LIBGL_TSD_KEY_N];

       pthread_descr p_nextlive, p_prevlive;
       pthread_descr p_nextwaiting;
       pthread_descr p_nextlock;
       ...
     };

This allows us to provide this functionality in libGL on all glibc-2.2
systems.  glibc-2.1 can also be supported on x86 (given a functional
kernel), as these versions did not use the segment registers for
pthread_descr access.  Given the importance of "free" thread local
storage to an OpenGL implementation, I believe this change is warranted,
even though it may not be the one chosen if the pthread_descr structure
was being designed from scratch.

The recent advancements in glibc's support for automatic thread-local
storage using the __thread keyword (as described in Ulrich Drepper's
To-Do list for glibc-2.3 under 'Implement TLS in dynamic linker',
http://people.redhat.com/drepper/todo-2.3.html) are encouraging, and
some may suggest that using this support would be more appropriate.
However, the additional pointer dereferences to access these new
thread-local storage areas will always be slower than the direct access
scheme proposed above.  Also, it will be some time before this system
is widely available, where the changes proposed above are binary
compatible with all glibc-2.2 releases.

-- Gareth

PS - I am not subscribed to libc-alpha, so please CC me on any replies.