|
From: Matthias S. <zz...@ge...> - 2013-10-29 21:19:12
Attachments:
pth_stackalign_movdqa.c
|
Hi there! My application crashes when it executes movdqa (SSE2) instruction under helgrind or drd. See also: https://bugs.kde.org/show_bug.cgi?id=324050 This only happens when the code runs in an additional thread (not in main thread). Also this happens only on x86, and not in an amd64 environment. I have now a small testcase to demonstrate the crash (by reusing parts of none/tests/x86/insn_sse2.c) valgrind-3.9.0.TEST1 is also affected. Steps to reproduce: # gcc -m32 -g -msse -o pth_stackalign_movdqa pth_stackalign_movdqa.c -lpthread # valgrind --tool=helgrind ./pth_stackalign_movdqa ==6888== Helgrind, a thread error detector ==6888== Copyright (C) 2007-2013, and GNU GPL'd, by OpenWorks LLP et al. ==6888== Using Valgrind-3.9.0.SVN and LibVEX; rerun with -h for copyright info ==6888== Command: ./pth_stackalign_movdqa ==6888== ==6888== ==6888== Process terminating with default action of signal 11 (SIGSEGV) ==6888== General Protection Fault ==6888== at 0x8048712: movdqa_2 (pth_stackalign_movdqa.c:43) ==6888== by 0x80487F4: ThreadFunction (pth_stackalign_movdqa.c:80) ==6888== by 0x402DC55: mythread_wrapper (hg_intercepts.c:233) ==6888== by 0x4084F53: start_thread (in /lib32/libpthread-2.17.so) ==6888== by 0x418BBBD: clone (in /lib32/libc-2.17.so) ==6888== ==6888== For counts of detected and suppressed errors, rerun with: -v ==6888== Use --history-level=approx or =none to gain increased speed, at ==6888== the cost of reduced accuracy of conflicting-access information ==6888== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) This is the same for ubuntu and gentoo. The reason is: For amd64, configure tests if gcc supports "-mpreferred-stack-boundary=2", but the gcc fails with "error: -mpreferred-stack-boundary=2 is not between 4 and 12", and so valgrind does not use this flag. But for 32bits this flag is used, and the stack is the only kept aligned to 2^2 = 4 bytes. SSE2 requires to keep the 16 bytes alignment, so the helgrind pth_create code leaves the stack 16-bytes unaligned. My suggestion is to completely remove this flag and let the compiler use the sane default values for stack alignment. Regards Matthias |
|
From: Philippe W. <phi...@sk...> - 2013-10-30 00:18:49
|
On Tue, 2013-10-29 at 22:18 +0100, Matthias Schwarzott wrote: > The reason is: For amd64, configure tests if gcc supports > "-mpreferred-stack-boundary=2", but the gcc fails with "error: > -mpreferred-stack-boundary=2 is not between 4 and 12", and so valgrind > does not use this flag. > > But for 32bits this flag is used, and the stack is the only kept aligned > to 2^2 = 4 bytes. > SSE2 requires to keep the 16 bytes alignment, so the helgrind pth_create > code leaves the stack 16-bytes unaligned. > > My suggestion is to completely remove this flag and let the compiler use > the sane default values for stack alignment. It looks like removing this flag degrades significantly the performances of some perf tests on x86 (sometimes by 20% or so). The below (hacky) patch seems to fix the problem, without degrading the performance. I tried various alternatives approaches e.g use -mstackrealign instead of reseting the boundary to 4 or use an attribute force_align_arg_pointer on hg_intercepts.c mythread_wrapper (I believe this is the culprit wrapper which "unaligns" the stack). But the only working thing I obtained was with the below =4. Note that the last line of the patch is inspired from memcheck/Makefile.am but seems to have no effect on the compilation. I suppose this is a dead line in memcheck/Makefile.am Philippe Index: helgrind/Makefile.am =================================================================== --- helgrind/Makefile.am (revision 13706) +++ helgrind/Makefile.am (working copy) @@ -95,7 +95,7 @@ vgpreload_helgrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_so_CPPFLAGS = \ $(AM_CPPFLAGS_@VGCONF_PLATFORM_PRI_CAPS@) vgpreload_helgrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_so_CFLAGS = \ - $(AM_CFLAGS_@VGCONF_PLATFORM_PRI_CAPS@) $(AM_CFLAGS_PIC) + $(AM_CFLAGS_@VGCONF_PLATFORM_PRI_CAPS@) $(AM_CFLAGS_PIC) -mpreferred-stack-boundary=4 vgpreload_helgrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_so_DEPENDENCIES = \ $(LIBREPLACEMALLOC_@VGCONF_PLATFORM_PRI_CAPS@) vgpreload_helgrind_@VGCONF_ARCH_PRI@_@VGCONF_OS@_so_LDFLAGS = \ @@ -116,3 +116,4 @@ $(LIBREPLACEMALLOC_LDFLAGS_@VGCONF_PLATFORM_SEC_CAPS@) endif +hg_intercepts.o: CFLAGS += -mstackrealign |
|
From: Matthias S. <zz...@ge...> - 2013-10-30 20:23:41
Attachments:
valgrind-realign-stack-preload-libs.patch
|
On 30.10.2013 02:19, Philippe Waroquiers wrote: > On Tue, 2013-10-29 at 22:18 +0100, Matthias Schwarzott wrote: > >> The reason is: For amd64, configure tests if gcc supports >> "-mpreferred-stack-boundary=2", but the gcc fails with "error: >> -mpreferred-stack-boundary=2 is not between 4 and 12", and so valgrind >> does not use this flag. >> >> But for 32bits this flag is used, and the stack is the only kept aligned >> to 2^2 = 4 bytes. >> SSE2 requires to keep the 16 bytes alignment, so the helgrind pth_create >> code leaves the stack 16-bytes unaligned. >> >> My suggestion is to completely remove this flag and let the compiler use >> the sane default values for stack alignment. > It looks like removing this flag degrades significantly the performances > of some perf tests on x86 (sometimes by 20% or so). > The below (hacky) patch seems to fix the problem, without degrading > the performance. > I tried various alternatives approaches e.g > use -mstackrealign instead of reseting the boundary to 4 > or use an attribute force_align_arg_pointer on hg_intercepts.c > mythread_wrapper (I believe this is the culprit wrapper which "unaligns" > the stack). > But the only working thing I obtained was with the below =4. > Note that the last line of the patch is inspired from > memcheck/Makefile.am but seems to have no effect on the compilation. > I suppose this is a dead line in memcheck/Makefile.am This patch solves the problem on the x86 installation. The basic idea is to add more flags to the preloaded lib object files to force stack alignment there. I used "-mincoming-stack-boundary=2 -mpreferred-stack-boundary=4" to work correct with a caller that has a 4-byte alignment. If all callers of this code are already correct, it might even be enough to overwrite the wrong setting by "-mpreferred-stack-boundary=4". I noticed, that the general flag handling deserves some cleanup. In my case of an amd64+x86 install, the flag is never set because gcc assumes 64bit compile without options, and for this 4-byte alignment is not valid. Here it would make sense to check it seperately for primary and secondary target, else the secondary target has the default 16-bytes alignment instead. Regards Matthias |