From: SourceForge.net <no...@so...> - 2009-12-01 14:09:03
|
Bugs item #2906836, was opened at 2009-12-01 15:09 Message generated for change (Tracker Item Submitted) made by thomas-denk You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=102435&aid=2906836&group_id=2435 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: gcc Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: thomas (thomas-denk) Assigned to: Nobody/Anonymous (nobody) Summary: __restrict__ not honoured or not working properly Initial Comment: Applies to: ----------- Target: mingw32 Configured with: ../gcc-4.4.0/configure --prefix=/mingw --build=mingw32 --enable-languages=c,ada,c++,fortran,objc,obj-c++ --disable-nls --disable-win32-registry --disable-werror --enable-threads --disable-symvers --enable-cxx-flags='-fno-function-sections -fno-data-sections' --enable-fully-dynamic-string --enable-libgo mp --enable-version-specific-runtime-libs --disable-sjlj-exceptions --program-suffix=-dw2 --with-pkgversion='TDM-1 mingw32' --with-bugurl=http://www.tdragon.net /recentgcc/bugs.php Thread model: win32 gcc version 4.4.0-dw2 (TDM-1 mingw32) Objective: ---------- Provide a template function to avoid doing manual unrolling and copying identical code for half a dozen logical operators. The aim is to read in data, overlap operations as loads are satisfied, and write out results. Simple enough. Implementation: --------------- template<typename T> void op(T f, __m128i* __restrict__ dst, __m128i* __restrict__ src, unsigned int n) { while(n >= 4) { n -= 4; dst[0] = f(src[0], dst[0]); dst[1] = f(src[1], dst[1]); dst[2] = f(src[2], dst[2]); dst[3] = f(src[3], dst[3]); dst += 4; src += 4; } while(n--) { *dst = f(*src, *dst); ++src; ++dst; } } Usage example: op(_mm_and_si128, buf1, buf2, size); Problem: -------- The compiler seems to assume aliasing and therefore produces non-pipelined code, even thought it had been told that there is no aliasing: pand (%ebx,%eax), %xmm0 movdqa %xmm0, (%ebx,%eax) movdqa 16(%edx,%eax), %xmm0 pand 16(%ebx,%eax), %xmm0 movdqa %xmm0, 16(%ebx,%eax) movdqa 32(%edx,%eax), %xmm0 pand 32(%ebx,%eax), %xmm0 movdqa %xmm0, 32(%ebx,%eax) movdqa 48(%edx,%eax), %xmm0 pand 48(%ebx,%eax), %xmm0 movdqa %xmm0, 48(%ebx,%eax) addl $64, %eax Rewriting the function so all inputs are manually consumed before outputs are written (the old, "before restrict age" way) like this: template<typename T> void op(T f, __m128i* __restrict__ dst, __m128i* __restrict__ src, unsigned int n) { while(n >= 4) { n -= 4; __m128i t1 = f(src[0], dst[0]); __m128i t2 = f(src[1], dst[1]); __m128i t3 = f(src[2], dst[2]); __m128i t4 = f(src[3], dst[3]); dst[0] = t1; dst[1] = t2; dst[2] = t3; dst[3] = t4; dst += 4; src += 4; } while(n--) { *dst = f(*src, *dst); ++src; ++dst; } } proves that the compiler is indeed able to generate the desired pipelined code: movdqa (%edx,%eax), %xmm3 movdqa 16(%edx,%eax), %xmm2 movdqa 32(%edx,%eax), %xmm1 movdqa 48(%edx,%eax), %xmm0 pand (%ebx,%eax), %xmm3 pand 16(%ebx,%eax), %xmm2 pand 32(%ebx,%eax), %xmm1 pand 48(%ebx,%eax), %xmm0 subl $4, %edi movdqa %xmm3, (%ebx,%eax) movdqa %xmm2, 16(%ebx,%eax) movdqa %xmm1, 32(%ebx,%eax) movdqa %xmm0, 48(%ebx,%eax) addl $64, %eax ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=102435&aid=2906836&group_id=2435 |