|
From: Earl C. <ear...@ya...> - 2016-07-04 08:56:55
|
I believe I have found a problem with handling of signals in a multithreaded program. I think I'm close to the root cause and would appreciate advice in where to look next. I have been seeing an intermittent error message regarding an invalid write to a stack location reported by valgrind. Near the time of failure, two threads are running (main thread, and one other), and a signal (SIGCHLD) is delivered to a signal handler which runs in the main thread. The invalid write is reported after the signal handler returns. I see this problem on valgrind 3.11 and earlier. ==25683== Invalid write of size 4 ==25683== at 0x806E19A: eventclockTime (timekeeping_.c:249) ... ==25683== Address 0xbeb6e680 is on thread 1's stack ==25683== in frame #0, created by eventclockTime (timekeeping_.c:246) This problem seems sensitive to stack layout (eg removing environment variables seems to make the problem go away), architecture (eg I see this on x86 Linux, but I have not observed this on x86_64), and timing (ie the problem does not occur on every run, and the functions in question are called frequently). Just before the failure, the main thread runs. I've added some extra instrumentation that is prefixed with "+ " showing stack activation around the failing address (which in this case is 0xbeb6e680): + mc_new_mem_stack 0xbeb6e650 0x5c 0x0 current_stack 0xBEB6E000-0xBEB6FFFF 0 new_SP 0xBEB6E62C old_SP 0xBEB6E640 current_stack 0xBEB6E000-0xBEB6FFFF 0 new_SP 0xBEB6E640 old_SP 0xBEB6E62C + mc_storev32 25683 4960 vbits32 0x0 0xbeb6e680 vabits8 0x55 + mc_storev32 25683 4973 0xbeb6e680 vabits8 0xaa current_stack 0xBEB6E000-0xBEB6FFFF 0 new_SP 0xBEB6E6AC old_SP 0xBEB6E650 + mc_die_mem_stack 0xbeb6e650 0x5c 0x0 + set_address_range_perms 1738 3514 0xbeb6e680 len 44 smoff 7376 vabits16 0 current_stack 0xBEB6E000-0xBEB6FFFF 0 new_SP 0xBEB6E670 old_SP 0xBEB6E6AC + mc_new_mem_stack 0xbeb6e670 0x3c 0x0 current_stack 0xBEB6E000-0xBEB6FFFF 0 new_SP 0xBEB6E6AC old_SP 0xBEB6E670 + mc_die_mem_stack 0xbeb6e670 0x3c 0x0 + set_address_range_perms 1738 3514 0xbeb6e680 len 44 smoff 7376 vabits16 0 current_stack 0xBEB6E000-0xBEB6FFFF 0 new_SP 0x482C494 old_SP 0x482C494 new current_stack 0x4630000-0x482EFFF 1 At this point, the other thread runs, and execution seems to switch between the two: ... current_stack 0x4630000-0x482EFFF 1 new_SP 0x482C4A4 old_SP 0x482C480 current_stack 0xBEB6E000-0xBEB6FFFF 0 new_SP 0xBEB6E490 old_SP 0xBEB6E4CC current_stack 0x4630000-0x482EFFF 1 new_SP 0x482C4A4 old_SP 0x482C4A4 ... Eventually, the SIGCHLD is delivered: current_stack 0x4630000-0x482EFFF 1 new_SP 0xBEB6E6B0 old_SP 0xBEB6E6B0 new current_stack 0xBEB6E000-0xBEB6FFFF 0 + sigframe_create prev eip 0x40010b2 + mc_new_mem_stack_signal 1911 0xbeb6dfd0 0x6d8 + sigframe_create esp 0xbeb6dfd0 top esp 0xbeb6e6b0 size 0x6e0 pushed signal frame; %ESP now = 0xbeb6dfd0, next %EIP = 0x806a110, status=2 current_stack 0xBEB6DFD0-0xBEB6FFFF 0 new_SP 0xBEB6DF44 old_SP 0xBEB6DFC0 new current_stack not found The messages about "new current_stack not found" continue, presumably because the signal frame is not part of the recognised stack area for the main thread. While handing the signal, at some point, the other thread runs: ... current_stack 0xBEB6DFD0-0xBEB6FFFF 0 new_SP 0xBEB6DCA0 old_SP 0xBEB6DC64 new current_stack not found current_stack 0xBEB6DFD0-0xBEB6FFFF 0 new_SP 0x482C410 old_SP 0x482C410 new current_stack 0x4630000-0x482EFFF 1 current_stack 0x4630000-0x482EFFF 1 new_SP 0x482C3D0 old_SP 0x482C40C ... And some time after that, execution returns to the main thread, but it seems that the current_stack remains bound to the other thread and the "Invalid write" error follows: ... current_stack 0x4630000-0x482EFFF 1 new_SP 0x482C628 old_SP 0x482C63C current_stack 0x4630000-0x482EFFF 1 new_SP 0xBEB6DC80 old_SP 0xBEB6DC80 new current_stack not found current_stack 0x4630000-0x482EFFF 1 new_SP 0xBEB6DCA0 old_SP 0xBEB6DC84 new current_stack not found ... current_stack 0x4630000-0x482EFFF 1 new_SP 0xBEB6DFC0 old_SP 0xBEB6DF44 new current_stack not found + sigframe_destroy isRT 0 signo 17 esp 0xbeb6dfd0 size 0x6d8 redzone 0 + calling from track_die_mem_stack_signal 1845 0xbeb6dfd0 0x6d8 + set_address_range_perms 1738 1846 0xbeb6e680 len 40 smoff 7376 vabits16 0 current_stack 0x4630000-0x482EFFF 1 new_SP 0xBEB6E680 old_SP 0xBEB6E6AC ... ==25683== Invalid write of size 4 ==25683== at 0x806E19A: eventclockTime (timekeeping_.c:249) I suspect the key to the problem is that the current_stack no longer tracks the main thread after the other thread runs while signal delivery is still occurring in the main thread. Is this a plausible explanation? The signal frame itself seems to also cause "new current_stack not found" to be reported when handling. Should stack tracking know about signal frames? Earl |