From: Nicholas M. <nm...@gm...> - 2006-02-28 00:51:34
|
On Mon, 27 Feb 2006 15:25:03 -0800, Ian Romanick wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > After listening to a couple fairly vocal people squawk about the x86-64 > dispatch stubs, I spent some time investigating the raised issues. The > primary issue is that the TLS versions of the stubs contains an > unnecessary function call to get the dispatch pointer. > I wasn't "squawking", I was complaining that your stated objections to the patch were based on erroneous facts. I would've been perfectly happy with "the advertised performance benefit isn't worth the effort involved" (although, there wasn't all that much effort). > > The results are not impressive. The libGL.so with the modified dispatch > routines is 13KiB larger. That's odd. The dispatch routines are 16-byte aligned and the inlining doesn't grow the size of the routine above 16-bytes. Did actual .text size change, or just the library on-disk size? > The measured API overhead was, at best, 1 clock > cycle faster. In most cases the measured overhead was much, much less > than the resolution of the measurement apparatus (e.g., glFogCoordfEXT > scored 71.284420 for the original vs. 71.280840 for the modified). > > Given these results, I'm inclined to leave the x86-64 assembly dispatch > stubs as they are. Evidence showing either a benchmark where the > modified dispatch stubs are faster or showing some flaw in my testing > methodology would, naturally, give me reason to revisit this issue. In > the mean time, I am considering it closed. > Does the benchmark test the effects of the return address stack overflowing? I don't know how deep call chains are typically in high-performance GL applications, but that extra entry on the function call stack might cause mispredictions on return. (Of course, if the call depth below the dispatch routine already exceeds the size of the RAS, this is irrelevant.) > If someone is really excited about improving the state of things on > x86-64, they might choose to investigate adding code to dynamically > generate dispatch functions for newly registered (by a DRI driver at > run-time) extension functions. This is currently done for x86, SPARC, > and Alpha, but not for x86-64, PowerPC, or IA-64. > How does dynamically generated dispatch functions improve performance? Are the routines different depending on whether or not the app is threaded? For that matter, why does Mesa have it's own reimplementation of dlsym() (or the equivalent for your platform of choice)? (Also, there doesn't seem to be anything Alpha-related in Mesa.) |