|
From: Ivo R. <iv...@iv...> - 2015-02-03 17:53:45
|
Hello developers, I am seeking advices with the following problem. I am currently porting Helgrind to Solaris OS (part of valgrind-solaris [1]). I have described all pthread and non-pthread intercepts, and Helgrind seems to be reporting races as it should be. But it also reports many false positives which I tracked down to falsely reporting a stack variable as accessed by two (or more) threads. First some background about Solaris. Solaris has its own libc and compared for example to GNU libc (utilized by Linux), threaded programs could exhibit different behaviour as regards to scheduling. The following utilizes a simple test program drd/tests/atomic_var which creates two threads and these threads read/write a global variable without any synchronization. On Linux typical scheduling is (on a machine with one CPU): 1. Main thread running 2. Main thread calling pthread_create(t1) [thread t1 created] 3. Main thread calling pthread_create(t2) [thread t2 created] 4. Main thread calling pthread_join() 5. Thread t1 running and exiting 6. Thread t2 running and exiting 7. Main thread joining t1 and t2 While on Solaris I observe in this case: 1. Main thread running 2. Main thread calling pthread_create(t1) [thread t1 created] 3. Thread t1 running and exiting 4. Main thread calling pthread_create(t2) [thread t2 created] 5. Thread t2 running and exiting 6. Main thread calling pthread_join() 7. Main thread joining t1 and t2 Because threads t1 and t2 (in this example) run serially (but there is no synchronization between them!) they also get the same stack; that is stack from thread t1 is reused for thread t2. That is not the case on Linux because threads got two different stacks. And Helgrind for some unknown reason reports all stack variables as falsely accessed with race. One such false report: ==4432== Possible data race during read of size 8 at 0x7FFC5FF70 by thread #3 ==4432== Locks held: none ==4432== at 0x7FFF5C9BE: mythread_wrapper (hg_intercepts.c:367) ==4432== by 0x7FFECFC5F: _thrp_setup (in /lib/amd64/libc.so.1) ==4432== by 0x7FFECFF3F: ??? (in /lib/amd64/libc.so.1) ==4432== ==4432== This conflicts with a previous write of size 8 by thread #2 ==4432== Locks held: none ==4432== at 0x7FFEB6C50: set_cancel_pending_flag (in /lib/amd64/libc.so.1) ==4432== Address 0x7ffc5ff70 is on thread #3's stack ==4432== in frame #0, created by mythread_wrapper (hg_intercepts.c:342) After having enabled tracing in hg_main.c, I can confirm that address 0x7ffc5ff70 first belonged to thread #2 and when it exited the same stack got assigned to thread #3: evh__pre_thread_ll_create(p=1, c=2) [Thread #2 is created] evh__new_mem_stack(0x7FFC5FF70, 8) evh__die_mem(0x7FFA62000, 2088960) [stack killed] evh__pre_thread_ll_exit(thr=2) evh__pre_thread_ll_create(p=1, c=2) [this is actually Thread #3] evh__new_mem_stack(0x7FFC5FF70, 8) evh__die_mem(0x7FFA62000, 2088960) [stack killed] evh__pre_thread_ll_exit(thr=2) I observed how Helgrind handles malloc/free and it seems to me that ultimately the same shadow_mem_make_NoAccess_NoFX() is called, as for the thread stack. I also read "avoid memory recycling" paragraph in [2]. But it is unclear to me if that applies also to thread stacks. How can I reason why Helgrind thinks there is a race here? What kind of tracing I need to enable to obtain necessary information? I am familiar with code in hg_intercepts.c, hg_main.c but did not study libhb... Kind regards, Ivo Raisr [1] https://bitbucket.org/setupji/valgrind-solaris [2] http://www.valgrind.org/docs/manual/hg-manual.html#hg-manual.effective-use |
|
From: Philippe W. <phi...@sk...> - 2015-02-04 21:16:57
|
On Tue, 2015-02-03 at 18:53 +0100, Ivo Raisr wrote: > That is not the case on Linux because threads got two different > stacks. > > > > And Helgrind for some unknown reason reports all stack variables > > as falsely accessed with race. I encountered such similar problems some months ago on linux, due to the 'thread stack cache' implemented by glibc. If a thread stack is really released, helgrind detects this and 'clears this memory' (typically because the stack is munmapped). But due to the glibc thread stack cache, the stack of one thread could be re-used by another thread. Then helgrind might report false race conditions, because no synchronisation is observed between these 2 threads, while it looks like they touch the same memory. IIRC, drd solves this a special way by detecting memory mmaped or allocated during pthread_create (but that can introduce false negative IIRC). For helgrind, I implemented a --sim-hint (i.e. a hack) to disable the glibc stack cache. See http://www.valgrind.org/docs/manual/manual-core.html#manual-core.rareopts --sim-hints no-nptl-pthread-stackcache Maybe the problem is that helgrind on solaris is similartly not informed that the stack memory is to be 'cleared', when re-used by another thread ? Philippe |
|
From: Ivo R. <iv...@iv...> - 2015-02-06 05:20:41
|
2015-02-04 22:18 GMT+01:00 Philippe Waroquiers < phi...@sk...>: > > If a thread stack is really released, helgrind detects this > and 'clears this memory' (typically because the stack is munmapped). > But due to the glibc thread stack cache, the stack of one thread > could be re-used by another thread. Then helgrind might report > false race conditions, because no synchronisation is observed between > these 2 threads, while it looks like they touch the same memory. > Hello Philippe, Thank you for confirming my suspicions. That is what happens here. > IIRC, drd solves this a special way by detecting memory mmaped or > allocated during pthread_create (but that can introduce false negative > IIRC). > Indeed, DRD solves this as follows in drd_start_using_mem(). I have not paid any particular attention to this fact before... 349 <http://code.metager.de/source/xref/valgrind/valgrind/drd/drd_main.c#349> *if* (UNLIKELY <http://code.metager.de/source/s?defs=UNLIKELY&project=valgrind>(DRD_ <http://code.metager.de/source/s?defs=DRD_&project=valgrind>(running_thread_inside_pthread_create <http://code.metager.de/source/s?defs=running_thread_inside_pthread_create&project=valgrind>)()))350 <http://code.metager.de/source/xref/valgrind/valgrind/drd/drd_main.c#350> {351 <http://code.metager.de/source/xref/valgrind/valgrind/drd/drd_main.c#351> DRD_ <http://code.metager.de/source/s?defs=DRD_&project=valgrind>(start_suppression <http://code.metager.de/source/s?defs=start_suppression&project=valgrind>)(a1 <http://code.metager.de/source/s?defs=a1&project=valgrind>, a2 <http://code.metager.de/source/s?defs=a2&project=valgrind>, "pthread_create()");352 <http://code.metager.de/source/xref/valgrind/valgrind/drd/drd_main.c#352> } For helgrind, I implemented a --sim-hint (i.e. a hack) to disable > the glibc stack cache. > See > http://www.valgrind.org/docs/manual/manual-core.html#manual-core.rareopts > --sim-hints no-nptl-pthread-stackcache > Unfortunately there is not such an easy hack to disable stack cache on Solaris. I also discovered additional objects are cached as well. I will probably need to intercept/wrap corresponding functions and clear explicitly the objects before they are returned for use. Thanks again for your insights! Ivosh |