[Mesa3d-dev] Texutil optimizations - an interesting story

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I've got a bunch of nice MMX code that I wanted to plug into the texutil
framework, so all the hardware drivers could benefit from fast texture
image conversions.  I set about creating a function table that had
specialized versions of the basic texture image conversion code, and at
the basic level this table would be filled with functions made from
splitting up _mesa_convert_teximage() into a whole bunch of smaller
routines.  This all applies to the subimage conversion code, but we'll
use the main one as an example.

This seemed like a great idea - do some basic tests, build up a function
table index, and then call the specialized conversion routine using this
index.  We could then make sure the Q3/UT/<insert favourite app> texture
conversions were super-fast.

Anyway, I ripped the _mesa_convert_teximage() code apart and built up a
whole bunch of separate functions, plugging these into the function
table.  I then ran a basic benchmark - converting a 32bpp RGBA texture
to ARGB4444 format, used often in 16bpp rendering.  I used the original
_mesa_convert_teximage() code, as well as the mgaConvertTexture() as I'd
noticed parts of this code were significantly faster than the original
Mesa code.

For this particular test, the MGA and Mesa code were about even, with
the MGA ahead by perhaps a percent or two.  To my great surprise, the
ripped-apart function was 25% slower.  I removed the function table and
called the _mesa_convert_teximage_argb_4444 function directly, but it
was still 25% slower.  I copied the *exact* code that was being executed
from the original Mesa code and plugged it into the separate function
again.  Still no good.  I removed the _mesa_image_address and
_mesa_image_row_stride calls (which is kinda the whole point of breaking
it apart like this) and made the routine inline, and the best I could
get it was around 10-15% slower.  At this point, I was confused.

Time to look at the compiler output.

Sure enough, the exact same block of C code was producing vastly
different assembly code.  Here's the code from the original Mesa
routine, which should be slower:

    1008:	8b 55 b4             	mov    %edx,DWORD PTR [%ebp-76]
    100b:	8b 7d a4             	mov    %edi,DWORD PTR [%ebp-92]
    100e:	8a 04 17             	mov    %al,BYTE PTR [%edi+%edx]
    1011:	24 f8                	and    %al,0xf8
    1013:	88 85 f8 fe ff ff    	mov    BYTE PTR [%ebp-264],%al
    1019:	66 c1 e0 08          	shl    %ax,0x8
    101d:	66 89 85 f4 fe ff ff 	mov    DWORD PTR [%ebp-268],%ax
    1024:	8a 16                	mov    %dl,BYTE PTR [%esi]
    1026:	80 e2 fc             	and    %dl,0xfc
    1029:	66 0f b6 c2          	movzx  %ax,%dl
    102d:	66 c1 e0 03          	shl    %ax,0x3
    1031:	66 09 85 f4 fe ff ff 	or     DWORD PTR [%ebp-268],%ax
    1038:	8b bd f0 fe ff ff    	mov    %edi,DWORD PTR [%ebp-272]
    103e:	8a 17                	mov    %dl,BYTE PTR [%edi]
    1040:	c0 ea 03             	shr    %dl,0x3
    1043:	66 0f b6 c2          	movzx  %ax,%dl
    1047:	8b 95 f4 fe ff ff    	mov    %edx,DWORD PTR [%ebp-268]
    104d:	09 c2                	or     %edx,%eax
    104f:	66 89 11             	mov    DWORD PTR [%ecx],%dx
    1052:	83 c1 02             	add    %ecx,2
    1055:	43                   	inc    %ebx
    1056:	83 c7 04             	add    %edi,4
    1059:	89 bd f0 fe ff ff    	mov    DWORD PTR [%ebp-272],%edi
    105f:	83 c6 04             	add    %esi,4
    1062:	83 45 a4 04          	add    DWORD PTR [%ebp-92],4
    1066:	3b 5d 0c             	cmp    %ebx,DWORD PTR [%ebp+12]
    1069:	7c 9d                	jl     1008

And here's the code from the separate routine, which should be faster:

    2392:	8b 75 e8             	mov    %esi,DWORD PTR [%ebp-24]
    2395:	8a 06                	mov    %al,BYTE PTR [%esi]
    2397:	24 f0                	and    %al,0xf0
    2399:	89 c2                	mov    %edx,%eax
    239b:	66 c1 e2 08          	shl    %dx,0x8
    239f:	8b 4d fc             	mov    %ecx,DWORD PTR [%ebp-4]
    23a2:	8b 75 ec             	mov    %esi,DWORD PTR [%ebp-20]
    23a5:	8a 04 0e             	mov    %al,BYTE PTR [%esi+%ecx]
    23a8:	24 f0                	and    %al,0xf0
    23aa:	25 ff 00 00 00       	and    %eax,0xff
    23af:	66 c1 e0 04          	shl    %ax,0x4
    23b3:	09 c2                	or     %edx,%eax
    23b5:	8b 4d e0             	mov    %ecx,DWORD PTR [%ebp-32]
    23b8:	8a 01                	mov    %al,BYTE PTR [%ecx]
    23ba:	24 f0                	and    %al,0xf0
    23bc:	25 ff 00 00 00       	and    %eax,0xff
    23c1:	09 c2                	or     %edx,%eax
    23c3:	8a 07                	mov    %al,BYTE PTR [%edi]
    23c5:	c0 e8 04             	shr    %al,0x4
    23c8:	25 ff 00 00 00       	and    %eax,0xff
    23cd:	09 c2                	or     %edx,%eax
    23cf:	8b 75 e4             	mov    %esi,DWORD PTR [%ebp-28]
    23d2:	66 89 16             	mov    DWORD PTR [%esi],%dx
    23d5:	83 c6 02             	add    %esi,2
    23d8:	89 75 e4             	mov    DWORD PTR [%ebp-28],%esi
    23db:	43                   	inc    %ebx
    23dc:	83 45 e8 04          	add    DWORD PTR [%ebp-24],4
    23e0:	83 c7 04             	add    %edi,4
    23e3:	83 c1 04             	add    %ecx,4
    23e6:	89 4d e0             	mov    DWORD PTR [%ebp-32],%ecx
    23e9:	83 45 ec 04          	add    DWORD PTR [%ebp-20],4
    23ed:	3b 5d 0c             	cmp    %ebx,DWORD PTR [%ebp+12]
    23f0:	7c a0                	jl     2392

The movzx instructions are very fast (they've been specially optimized
for doing mixed 8/16 and 32 bit operations on PPro/PII/PIII processors),
while the second listing has lots of partial register stalls which seem
to be killing the performance.

I'm stunned and amazed.

-- Gareth