Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#200 Invalid vectorization (SSE)

closed-invalid
nobody
5
2010-11-13
2010-10-29
Sergey
No

Package used: mingw-w32-bin_i686-mingw_20101003_sezero.zip
Command line options: -O3 -march=barcelona
Sample code:
void epic_fail(unsigned char *a)
{
for (int i=0; i<256; i++) a[i] = i;
}

Bug description:
1. With -O3 option GCC unrolls loop with use of XMM registers. "movaps" instruction is used to store 16 byte blocks. This instruction require 16 byte aligment of data, but alignment of pointers can not be ensured. As a result the program always crash.
2. Vectorized code itself is extermely ineffective. It consists of 16 blocks included below.

.text:00401A40 movdqa xmm3, xmm4
.text:00401A44 punpcklwd xmm4, xmm5
.text:00401A48 paddd xmm3, xmm1
.text:00401A4C movdqa xmm5, xmm4
.text:00401A50 punpcklwd xmm4, xmm6
.text:00401A54 punpckhwd xmm5, xmm6
.text:00401A58 movdqa xmm6, xmm2
.text:00401A5C paddd xmm6, xmm0
.text:00401A60 punpcklwd xmm4, xmm5
.text:00401A64 movdqa xmm5, xmm2
.text:00401A68 punpcklwd xmm2, xmm6
.text:00401A6C punpckhwd xmm5, xmm6
.text:00401A70 movdqa xmm6, xmm2
.text:00401A74 punpcklwd xmm2, xmm5
.text:00401A78 punpckhwd xmm6, xmm5
.text:00401A7C movdqa xmm5, xmm4
.text:00401A80 punpcklwd xmm2, xmm6
.text:00401A84 movdqa xmm6, xmm3
.text:00401A88 punpckhbw xmm5, xmm2
.text:00401A8C punpcklbw xmm4, xmm2
.text:00401A90 movdqa xmm2, xmm4
.text:00401A94 punpcklbw xmm4, xmm5
.text:00401A98 punpckhbw xmm2, xmm5
.text:00401A9C movdqa xmm5, xmm4
.text:00401AA0 punpcklbw xmm4, xmm2
.text:00401AA4 punpckhbw xmm5, xmm2
.text:00401AA8 punpcklbw xmm4, xmm5
.text:00401AAC movdqa xmm5, xmm3
.text:00401AB0 paddd xmm5, xmm0
.text:00401AB4 punpckhwd xmm6, xmm5
.text:00401AB8 movdqa xmm2, xmm5
.text:00401ABC paddd xmm2, xmm0
.text:00401AC0 movaps xmmword ptr [esp+20h], xmm4

Discussion

  • Ozkan Sezer
    Ozkan Sezer
    2010-10-29

    Is the issue reproducible with gcc-4.5 based builds?

     
  • Sergey
    Sergey
    2010-10-29

    Yes, I tested with TDM-GCC 4.5.1, exactly the same bug.
    Reducing optimizations to -O2 removes the bug simply because SSE instructions are not used.

     
  • Ozkan Sezer
    Ozkan Sezer
    2010-10-29

    Then unless Kai suggest otherwise, I think this must be reported at GCC bugzilla, with reference to gcc-4.5 and by adding your small testcase along with its assembler output (use an additional -S on your gcc command line). Please add the gcc bugzilla number here so that we can follow more easily.

     
  • Sergey
    Sergey
    2010-10-29

    Unfortunately I can't prepare testcase for GCC4.5.1 It's behaviour seems to be improved partially. Simple hello world apps work correctly, and I found some alignment checks in assembly code. However larger project with 2M executable still suffers from the bug.

     
  • Ozkan Sezer
    Ozkan Sezer
    2010-10-29

    Well, gcc developers will ignore a gcc-4.4 based bug report because my personal build is heavily patched and is not an unmodified FSF-gcc, therefore a simple testcase will be needed. In the meantime, pinging Kai for better help.

     
  • Kai Tietz
    Kai Tietz
    2010-10-29

    well, the feature of aligned comdat (which seems to be here the issue, as data alignment is supported otherwise already even for 4.4) is present until 4.5.x. This should be the way to solve your issue. As 32-bit has in general just a 4-byte alignment on stack, you could try to enforce for your app a 16-byte alignment.
    I know that we are not doing in our startup-code for 32-bit case enforce stack-alignment to 16-bytes. Maybe this could help to solve your issues here to. For testing this, just make sure that in your main-function the following inline-assembly is present "and $0xfffffff0, %esp'.
    If this helps, we will add such an alignment to our startup.

    Regards,
    Kai

     
  • Kai Tietz
    Kai Tietz
    2010-10-30

    Thanks for telling me, but this bug report is better placed in gcc's bugzilla. As here loop unrolling produces indeed pretty weak code (and yes, pointer alignment can be here an issue).
    So I would kindly ask you to report this issue on gcc's bug-tracker. I will close this bug as invalid (not that this bug is invalid in terms of report, but not a bug of our runtime), but keep it pending, so additional comments are possible.

    Kai

     
  • Kai Tietz
    Kai Tietz
    2010-10-30

    • status: open --> pending-invalid
     
  • Try compiling with "-mstackrealign" - this can have a minor impact on performance but makes GCC properly align data on stack which is processed by SSE instructions. Alternatively you can try the patch included in my MinGW GCC packages - have a look at gcc-fix-sse-stackalignment.dpatch inside https://launchpad.net/~tobydox/+archive/mingw/+files/mingw-x-gcc_4.5.2~snapshot-12.diff.gz - without this, lots of my code has been miscompiled when enabling SSE instructions.

     
  • Sergey
    Sergey
    2010-11-02

    Unfortunately I was too busy that days, I didn't posted the bug into the main GCC bug thread.
    Now I have some additional details. GCC is aware of 16 byte alignment needed for SSE. It uses 3 different strategies to mantain alignment:
    1. For local and global arrays that is to be accessed with SSE, 16 byte alignment is made.
    2. For variables with known unalignment shift, specific changes are made into the code.
    3. For variables with unkown alignment GCC generate a bit mre complicated code with runtime alignment vaildation
    4. Program fails when variable is placed on stack, but stack alignment is destroyed by a system call.
    The most common case is a code inside DialogProc() wich invoked indirectly via DialogBox() function. It's obvious that DialogBox() unaware of 16 byte things.

     
  • Sergey
    Sergey
    2010-11-02

    -mstackrealign helps
    I will hope vectorization efficiency will be improved in the near future

     
  • Sergey
    Sergey
    2010-11-02

    -mstackrealign helps
    I will hope vectorization efficiency will be improved in the near future

     
  • Sergey
    Sergey
    2010-11-02

    -mstackrealign helps
    I will hope vectorization efficiency will be improved in the near future

     
  • This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 14 days (the time period specified by
    the administrator of this Tracker).

     
    • status: pending-invalid --> closed-invalid