[Valgrind-developers] Helgrind and strange races

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello developers,

I am seeking advices with the following problem.
I am currently porting Helgrind to Solaris OS (part of valgrind-solaris
[1]).
I have described all pthread and non-pthread intercepts,
and Helgrind seems to be reporting races as it should be.
But it also reports many false positives which I tracked down
to falsely reporting a stack variable as accessed by two (or more) threads.

First some background about Solaris.
Solaris has its own libc and compared for example to
GNU libc (utilized by Linux), threaded programs could exhibit
different behaviour as regards to scheduling.

The following utilizes a simple test program drd/tests/atomic_var
which creates two threads and these threads read/write a global
variable without any synchronization.

On Linux typical scheduling is (on a machine with one CPU):
1. Main thread running
2. Main thread calling pthread_create(t1) [thread t1 created]
3. Main thread calling pthread_create(t2) [thread t2 created]
4. Main thread calling pthread_join()
5. Thread t1 running and exiting
6. Thread t2 running and exiting
7. Main thread joining t1 and t2

While on Solaris I observe in this case:
1. Main thread running
2. Main thread calling pthread_create(t1) [thread t1 created]
3. Thread t1 running and exiting
4. Main thread calling pthread_create(t2) [thread t2 created]
5. Thread t2 running and exiting
6. Main thread calling pthread_join()
7. Main thread joining t1 and t2

Because threads t1 and t2 (in this example) run serially
(but there is no synchronization between them!) they also get
the same stack; that is stack from thread t1 is reused for thread t2.
That is not the case on Linux because threads got two different stacks.

And Helgrind for some unknown reason reports all stack variables
as falsely accessed with race.
One such false report:
==4432== Possible data race during read of size 8 at 0x7FFC5FF70 by thread
#3
==4432== Locks held: none
==4432==    at 0x7FFF5C9BE: mythread_wrapper (hg_intercepts.c:367)
==4432==    by 0x7FFECFC5F: _thrp_setup (in /lib/amd64/libc.so.1)
==4432==    by 0x7FFECFF3F: ??? (in /lib/amd64/libc.so.1)
==4432==
==4432== This conflicts with a previous write of size 8 by thread #2
==4432== Locks held: none
==4432==    at 0x7FFEB6C50: set_cancel_pending_flag (in
/lib/amd64/libc.so.1)
==4432==  Address 0x7ffc5ff70 is on thread #3's stack
==4432==  in frame #0, created by mythread_wrapper (hg_intercepts.c:342)

After having enabled tracing in hg_main.c, I can confirm that
address 0x7ffc5ff70 first belonged to thread #2 and when it exited
the same stack got assigned to thread #3:
evh__pre_thread_ll_create(p=1, c=2)    [Thread #2 is created]
evh__new_mem_stack(0x7FFC5FF70, 8)
evh__die_mem(0x7FFA62000, 2088960)   [stack killed]
evh__pre_thread_ll_exit(thr=2)
evh__pre_thread_ll_create(p=1, c=2)     [this is actually Thread #3]
evh__new_mem_stack(0x7FFC5FF70, 8)
evh__die_mem(0x7FFA62000, 2088960)   [stack killed]
evh__pre_thread_ll_exit(thr=2)

I observed how Helgrind handles malloc/free and it seems to me
that ultimately the same shadow_mem_make_NoAccess_NoFX()
is called, as for the thread stack.
I also read "avoid memory recycling" paragraph in [2]. But
it is unclear to me if that applies also to thread stacks.

How can I reason why Helgrind thinks there is a race here?
What kind of tracing I need to enable to obtain necessary information?
I am familiar with code in hg_intercepts.c, hg_main.c but did not
study libhb...

Kind regards,
Ivo Raisr

[1] https://bitbucket.org/setupji/valgrind-solaris
[2]
http://www.valgrind.org/docs/manual/hg-manual.html#hg-manual.effective-use