From: Gareth H. <ga...@va...> - 2000-10-16 07:46:28
|
I've got a bunch of nice MMX code that I wanted to plug into the texutil framework, so all the hardware drivers could benefit from fast texture image conversions. I set about creating a function table that had specialized versions of the basic texture image conversion code, and at the basic level this table would be filled with functions made from splitting up _mesa_convert_teximage() into a whole bunch of smaller routines. This all applies to the subimage conversion code, but we'll use the main one as an example. This seemed like a great idea - do some basic tests, build up a function table index, and then call the specialized conversion routine using this index. We could then make sure the Q3/UT/<insert favourite app> texture conversions were super-fast. Anyway, I ripped the _mesa_convert_teximage() code apart and built up a whole bunch of separate functions, plugging these into the function table. I then ran a basic benchmark - converting a 32bpp RGBA texture to ARGB4444 format, used often in 16bpp rendering. I used the original _mesa_convert_teximage() code, as well as the mgaConvertTexture() as I'd noticed parts of this code were significantly faster than the original Mesa code. For this particular test, the MGA and Mesa code were about even, with the MGA ahead by perhaps a percent or two. To my great surprise, the ripped-apart function was 25% slower. I removed the function table and called the _mesa_convert_teximage_argb_4444 function directly, but it was still 25% slower. I copied the *exact* code that was being executed from the original Mesa code and plugged it into the separate function again. Still no good. I removed the _mesa_image_address and _mesa_image_row_stride calls (which is kinda the whole point of breaking it apart like this) and made the routine inline, and the best I could get it was around 10-15% slower. At this point, I was confused. Time to look at the compiler output. Sure enough, the exact same block of C code was producing vastly different assembly code. Here's the code from the original Mesa routine, which should be slower: 1008: 8b 55 b4 mov %edx,DWORD PTR [%ebp-76] 100b: 8b 7d a4 mov %edi,DWORD PTR [%ebp-92] 100e: 8a 04 17 mov %al,BYTE PTR [%edi+%edx] 1011: 24 f8 and %al,0xf8 1013: 88 85 f8 fe ff ff mov BYTE PTR [%ebp-264],%al 1019: 66 c1 e0 08 shl %ax,0x8 101d: 66 89 85 f4 fe ff ff mov DWORD PTR [%ebp-268],%ax 1024: 8a 16 mov %dl,BYTE PTR [%esi] 1026: 80 e2 fc and %dl,0xfc 1029: 66 0f b6 c2 movzx %ax,%dl 102d: 66 c1 e0 03 shl %ax,0x3 1031: 66 09 85 f4 fe ff ff or DWORD PTR [%ebp-268],%ax 1038: 8b bd f0 fe ff ff mov %edi,DWORD PTR [%ebp-272] 103e: 8a 17 mov %dl,BYTE PTR [%edi] 1040: c0 ea 03 shr %dl,0x3 1043: 66 0f b6 c2 movzx %ax,%dl 1047: 8b 95 f4 fe ff ff mov %edx,DWORD PTR [%ebp-268] 104d: 09 c2 or %edx,%eax 104f: 66 89 11 mov DWORD PTR [%ecx],%dx 1052: 83 c1 02 add %ecx,2 1055: 43 inc %ebx 1056: 83 c7 04 add %edi,4 1059: 89 bd f0 fe ff ff mov DWORD PTR [%ebp-272],%edi 105f: 83 c6 04 add %esi,4 1062: 83 45 a4 04 add DWORD PTR [%ebp-92],4 1066: 3b 5d 0c cmp %ebx,DWORD PTR [%ebp+12] 1069: 7c 9d jl 1008 And here's the code from the separate routine, which should be faster: 2392: 8b 75 e8 mov %esi,DWORD PTR [%ebp-24] 2395: 8a 06 mov %al,BYTE PTR [%esi] 2397: 24 f0 and %al,0xf0 2399: 89 c2 mov %edx,%eax 239b: 66 c1 e2 08 shl %dx,0x8 239f: 8b 4d fc mov %ecx,DWORD PTR [%ebp-4] 23a2: 8b 75 ec mov %esi,DWORD PTR [%ebp-20] 23a5: 8a 04 0e mov %al,BYTE PTR [%esi+%ecx] 23a8: 24 f0 and %al,0xf0 23aa: 25 ff 00 00 00 and %eax,0xff 23af: 66 c1 e0 04 shl %ax,0x4 23b3: 09 c2 or %edx,%eax 23b5: 8b 4d e0 mov %ecx,DWORD PTR [%ebp-32] 23b8: 8a 01 mov %al,BYTE PTR [%ecx] 23ba: 24 f0 and %al,0xf0 23bc: 25 ff 00 00 00 and %eax,0xff 23c1: 09 c2 or %edx,%eax 23c3: 8a 07 mov %al,BYTE PTR [%edi] 23c5: c0 e8 04 shr %al,0x4 23c8: 25 ff 00 00 00 and %eax,0xff 23cd: 09 c2 or %edx,%eax 23cf: 8b 75 e4 mov %esi,DWORD PTR [%ebp-28] 23d2: 66 89 16 mov DWORD PTR [%esi],%dx 23d5: 83 c6 02 add %esi,2 23d8: 89 75 e4 mov DWORD PTR [%ebp-28],%esi 23db: 43 inc %ebx 23dc: 83 45 e8 04 add DWORD PTR [%ebp-24],4 23e0: 83 c7 04 add %edi,4 23e3: 83 c1 04 add %ecx,4 23e6: 89 4d e0 mov DWORD PTR [%ebp-32],%ecx 23e9: 83 45 ec 04 add DWORD PTR [%ebp-20],4 23ed: 3b 5d 0c cmp %ebx,DWORD PTR [%ebp+12] 23f0: 7c a0 jl 2392 The movzx instructions are very fast (they've been specially optimized for doing mixed 8/16 and 32 bit operations on PPro/PII/PIII processors), while the second listing has lots of partial register stalls which seem to be killing the performance. I'm stunned and amazed. -- Gareth |
From: Gareth H. <ga...@va...> - 2000-10-16 11:53:13
|
I guess the C code might help, although not much... Original: GLboolean _mesa_convert_teximage(MesaIntTexFormat dstFormat, GLint dstWidth, GLint dstHeight, GLvoid *dstImage, GLint dstRowStride, GLint srcWidth, GLint srcHeight, GLenum srcFormat, GLenum srcType, const GLvoid *srcImage, const struct gl_pixelstore_attrib *packing) { const GLint wScale = dstWidth / srcWidth; /* must be power of two */ const GLint hScale = dstHeight / srcHeight; /* must be power of two */ ASSERT(dstWidth >= srcWidth); ASSERT(dstHeight >= srcHeight); ASSERT(dstImage); ASSERT(srcImage); ASSERT(packing); switch (dstFormat) { ... case MESA_A4_R4_G4_B4: /* store as 16-bit texels (GR_TEXFMT_ARGB_4444) */ if (srcFormat == GL_BGRA && srcType == GL_UNSIGNED_SHORT_4_4_4_4_REV){ ... } else if (srcFormat == GL_RGBA && srcType == GL_UNSIGNED_BYTE) { /* general case */ if (wScale == 1 && hScale == 1) { const GLubyte *src = _mesa_image_address(packing, srcImage, srcWidth, srcHeight, srcFormat, srcType, 0, 0, 0); const GLint srcStride = _mesa_image_row_stride(packing, srcWidth, srcFormat, srcType); GLushort *dst = (GLushort *) dstImage; GLint row; for (row = 0; row < dstHeight; row++) { GLint col, col4; for (col = col4 = 0; col < dstWidth; col++, col4 += 4) { GLubyte r = src[col4 + 0]; GLubyte g = src[col4 + 1]; GLubyte b = src[col4 + 2]; GLubyte a = src[col4 + 3]; dst[col] = ((a & 0xf0) << 8) | ((r & 0xf0) << 4) | ((g & 0xf0) ) | ((b & 0xf0) >> 4); } src += srcStride; dst = (GLushort *) ((GLubyte *) dst + dstRowStride); } } else { ... } } else { ... } break; ... } return GL_TRUE; } Special case: GLboolean _mesa_convert_teximage_argb_4444(MesaIntTexFormat dstFormat, GLint dstWidth, GLint dstHeight, GLvoid *dstImage, GLint dstRowStride, GLint srcWidth, GLint srcHeight, GLenum srcFormat, GLenum srcType, const GLvoid *srcImage, const struct gl_pixelstore_attrib *packing) { const GLubyte *src = srcImage; const GLint srcStride = srcWidth * 2; GLushort *dst = (GLushort *) dstImage; GLint row; for (row = 0; row < dstHeight; row++) { GLint col, col4; for (col = col4 = 0; col < dstWidth; col++, col4 += 4) { GLubyte r = src[col4 + 0]; GLubyte g = src[col4 + 1]; GLubyte b = src[col4 + 2]; GLubyte a = src[col4 + 3]; dst[col] = ((a & 0xf0) << 8) | ((r & 0xf0) << 4) | ((g & 0xf0) ) | ((b & 0xf0) >> 4); } src += srcStride; dst = (GLushort *) ((GLubyte *) dst + dstRowStride); } return GL_TRUE; } Figure that one out... |
From: Josh V. <ho...@na...> - 2000-10-16 18:57:49
|
Gareth Hughes <ga...@va...> writes: > The movzx instructions are very fast (they've been specially optimized > for doing mixed 8/16 and 32 bit operations on PPro/PII/PIII processors), > while the second listing has lots of partial register stalls which seem > to be killing the performance. > > I'm stunned and amazed. It looks like gcc is trying to keep the values as ubytes for too long. If you change r,g,b,a from GLubyte to GLuint, it seems to to a better job. With an old GCC 2.96 snapshot (20000529) and using uints instead of ubytes, you get this: .file "mesaprs.c" .version "01.01" gcc2_compiled.: .text .align 4 .globl _mesa_convert_teximage_argb_4444 .type _mesa_convert_teximage_argb_4444,@function _mesa_convert_teximage_argb_4444: pushl %ebp pushl %edi pushl %esi pushl %ebx subl $12, %esp movl 52(%esp), %eax sall $1, %eax movl $0, (%esp) movl 40(%esp), %edx movl %eax, 8(%esp) cmpl %edx, (%esp) movl 44(%esp), %eax movl 68(%esp), %esi movl %eax, 4(%esp) jae .L13 .p2align 2 .L6: xorl %ebp, %ebp xorl %edi, %edi cmpl 36(%esp), %ebp jae .L14 .p2align 2 .L10: movzbl (%edi,%esi), %edx movzbl 3(%edi,%esi), %eax andl $240, %eax andl $240, %edx movzbl 1(%edi,%esi), %ecx sall $4, %edx sall $8, %eax movzbl 2(%edi,%esi), %ebx orl %edx, %eax andl $240, %ecx orl %ecx, %eax shrl $4, %ebx orl %ebx, %eax movl 4(%esp), %edx movw %ax, (%edx,%ebp,2) incl %ebp addl $4, %edi cmpl 36(%esp), %ebp jb .L10 .L14: movl 48(%esp), %eax incl (%esp) movl 40(%esp), %edx addl 8(%esp), %esi addl %eax, 4(%esp) cmpl %edx, (%esp) jb .L6 .L13: addl $12, %esp popl %ebx popl %esi movl $1, %eax popl %edi popl %ebp ret .Lfe1: .size _mesa_convert_teximage_argb_4444,.Lfe1-_mesa_convert_teximage_argb_4444 .ident "GCC: (GNU) 2.96 20000529 (experimental)" |
From: Gareth H. <ga...@va...> - 2000-10-16 23:18:13
|
Josh Vanderhoof wrote: > > It looks like gcc is trying to keep the values as ubytes for too long. > If you change r,g,b,a from GLubyte to GLuint, it seems to to a better > job. Yes, I had noticed a significant speedup by doing this. The final version of the templated code wasn't going to be this exact code. However, I was more interested in the fact that gcc output vastly different assembly depending on where the C code was located, with such a huge difference in performance. I though I might share this with the list. -- Gareth |