From: Gareth H. <ga...@va...> - 2000-10-16 07:46:28
|
I've got a bunch of nice MMX code that I wanted to plug into the texutil framework, so all the hardware drivers could benefit from fast texture image conversions. I set about creating a function table that had specialized versions of the basic texture image conversion code, and at the basic level this table would be filled with functions made from splitting up _mesa_convert_teximage() into a whole bunch of smaller routines. This all applies to the subimage conversion code, but we'll use the main one as an example. This seemed like a great idea - do some basic tests, build up a function table index, and then call the specialized conversion routine using this index. We could then make sure the Q3/UT/<insert favourite app> texture conversions were super-fast. Anyway, I ripped the _mesa_convert_teximage() code apart and built up a whole bunch of separate functions, plugging these into the function table. I then ran a basic benchmark - converting a 32bpp RGBA texture to ARGB4444 format, used often in 16bpp rendering. I used the original _mesa_convert_teximage() code, as well as the mgaConvertTexture() as I'd noticed parts of this code were significantly faster than the original Mesa code. For this particular test, the MGA and Mesa code were about even, with the MGA ahead by perhaps a percent or two. To my great surprise, the ripped-apart function was 25% slower. I removed the function table and called the _mesa_convert_teximage_argb_4444 function directly, but it was still 25% slower. I copied the *exact* code that was being executed from the original Mesa code and plugged it into the separate function again. Still no good. I removed the _mesa_image_address and _mesa_image_row_stride calls (which is kinda the whole point of breaking it apart like this) and made the routine inline, and the best I could get it was around 10-15% slower. At this point, I was confused. Time to look at the compiler output. Sure enough, the exact same block of C code was producing vastly different assembly code. Here's the code from the original Mesa routine, which should be slower: 1008: 8b 55 b4 mov %edx,DWORD PTR [%ebp-76] 100b: 8b 7d a4 mov %edi,DWORD PTR [%ebp-92] 100e: 8a 04 17 mov %al,BYTE PTR [%edi+%edx] 1011: 24 f8 and %al,0xf8 1013: 88 85 f8 fe ff ff mov BYTE PTR [%ebp-264],%al 1019: 66 c1 e0 08 shl %ax,0x8 101d: 66 89 85 f4 fe ff ff mov DWORD PTR [%ebp-268],%ax 1024: 8a 16 mov %dl,BYTE PTR [%esi] 1026: 80 e2 fc and %dl,0xfc 1029: 66 0f b6 c2 movzx %ax,%dl 102d: 66 c1 e0 03 shl %ax,0x3 1031: 66 09 85 f4 fe ff ff or DWORD PTR [%ebp-268],%ax 1038: 8b bd f0 fe ff ff mov %edi,DWORD PTR [%ebp-272] 103e: 8a 17 mov %dl,BYTE PTR [%edi] 1040: c0 ea 03 shr %dl,0x3 1043: 66 0f b6 c2 movzx %ax,%dl 1047: 8b 95 f4 fe ff ff mov %edx,DWORD PTR [%ebp-268] 104d: 09 c2 or %edx,%eax 104f: 66 89 11 mov DWORD PTR [%ecx],%dx 1052: 83 c1 02 add %ecx,2 1055: 43 inc %ebx 1056: 83 c7 04 add %edi,4 1059: 89 bd f0 fe ff ff mov DWORD PTR [%ebp-272],%edi 105f: 83 c6 04 add %esi,4 1062: 83 45 a4 04 add DWORD PTR [%ebp-92],4 1066: 3b 5d 0c cmp %ebx,DWORD PTR [%ebp+12] 1069: 7c 9d jl 1008 And here's the code from the separate routine, which should be faster: 2392: 8b 75 e8 mov %esi,DWORD PTR [%ebp-24] 2395: 8a 06 mov %al,BYTE PTR [%esi] 2397: 24 f0 and %al,0xf0 2399: 89 c2 mov %edx,%eax 239b: 66 c1 e2 08 shl %dx,0x8 239f: 8b 4d fc mov %ecx,DWORD PTR [%ebp-4] 23a2: 8b 75 ec mov %esi,DWORD PTR [%ebp-20] 23a5: 8a 04 0e mov %al,BYTE PTR [%esi+%ecx] 23a8: 24 f0 and %al,0xf0 23aa: 25 ff 00 00 00 and %eax,0xff 23af: 66 c1 e0 04 shl %ax,0x4 23b3: 09 c2 or %edx,%eax 23b5: 8b 4d e0 mov %ecx,DWORD PTR [%ebp-32] 23b8: 8a 01 mov %al,BYTE PTR [%ecx] 23ba: 24 f0 and %al,0xf0 23bc: 25 ff 00 00 00 and %eax,0xff 23c1: 09 c2 or %edx,%eax 23c3: 8a 07 mov %al,BYTE PTR [%edi] 23c5: c0 e8 04 shr %al,0x4 23c8: 25 ff 00 00 00 and %eax,0xff 23cd: 09 c2 or %edx,%eax 23cf: 8b 75 e4 mov %esi,DWORD PTR [%ebp-28] 23d2: 66 89 16 mov DWORD PTR [%esi],%dx 23d5: 83 c6 02 add %esi,2 23d8: 89 75 e4 mov DWORD PTR [%ebp-28],%esi 23db: 43 inc %ebx 23dc: 83 45 e8 04 add DWORD PTR [%ebp-24],4 23e0: 83 c7 04 add %edi,4 23e3: 83 c1 04 add %ecx,4 23e6: 89 4d e0 mov DWORD PTR [%ebp-32],%ecx 23e9: 83 45 ec 04 add DWORD PTR [%ebp-20],4 23ed: 3b 5d 0c cmp %ebx,DWORD PTR [%ebp+12] 23f0: 7c a0 jl 2392 The movzx instructions are very fast (they've been specially optimized for doing mixed 8/16 and 32 bit operations on PPro/PII/PIII processors), while the second listing has lots of partial register stalls which seem to be killing the performance. I'm stunned and amazed. -- Gareth |