|
From: Konstantin S. <kon...@gm...> - 2008-03-28 11:44:48
|
Hi,
I'd like to collect ideas regarding the subject raised today in a
separate thread: how to decipher Helgrind's reports about 'Possible
data race'.
So, this is the usual format of helgrind report about a race:
- ACCESS_TYPE (read or write)
- memory address ADDR
- thread segment SEG
- thread THR
- stack dump of access ACCESS_CONTEXT
- stack dump of the place where ADDR has been allocated:
ALLOC_CONTEXT (or a name of a global variable).
- stack dump of the place where last consistently used lock was
used: LOCK_CONTEXT
- Previous state OLD_STATE which indicates:
- If there were writes to this memory before (or only reads
happened): R or W
- In what segments and threads where these previous accesses:
(like this: S123/T1 S456/T3 S987/T7)
So, if the race happens on a global var, life is easy: we just check
all uses of this var manually.
If there are too many uses, we can call Helgrind second time with
--trace-addr=ADDR --trace-level=2 and we get all accesses.
If the race happens on a memory location allocated from heap, and
which is e.g. a field of a structure inside out code, --trace-addr may
not work (in my experience it never works on big apps).
This is because addresses allocated in multi-threaded programs differ
from run to run (idea: hack the allocator to make it more
reproducible; not sure if possible).
In this case VG_USERREQ__HG_TRACE_MEM is useful: we annotate the racy
field with this client request and rerun Helgrind with --trace-level=2
getting all the accesses.
In my experience it helps in ~50% of cases.
I think that sometimes printing the traces annoys the scheduler and
the race gets hidden (idea: instead of printing traces store them
somewhere and print only when showing the race).
Ok, but what shall we do if the race is inside some library code (e.g.
STL)? We can't annotate it...
That's what I do (not perfect and requires a lot of manual work):
- On each segment creation I record the current context (stack dump)
(added ExeContext* field to Segment)
- When printing a race report I also print contexts of all segment in
the OLD_STATE.
It gives me information like this: access to ADDR in thread T1
happened after context C1, access in T2 happened after C2, ...
Usually, C1 and C2 are quite far from the actual access :(
But now I can find the actual access by creating new segments in
random parts of code starting from C1 and C2. (the new segments can be
created by annotating the code with
_VG_USERREQ__HG_PTHREAD_COND_SIGNAL_PRE(0xDEADBEAF))
A long process I should say... Just yesterday I spent 1.5 hours trying
to understand a particularly nasty race reported inside vector<>.
Does anyone have a better idea?
--kcc
P.S. Julian, the mail to you still bounces:
----- Transcript of session follows -----
.. while talking to open-works.net
>>> DATA
<<< 554 5.7.1 Penalty Box error, please contact the server support to
ensure delivery
|
|
From: Julian S. <js...@ac...> - 2008-03-28 23:30:20
|
Konstantin > I'd like to collect ideas regarding the subject raised today in a > separate thread: how to decipher Helgrind's reports about 'Possible > data race'. That is an excellent question; unfortunately not easy to answer. Let me ask a related question. In a way it is chasing the problem from the other end. Question is: In an ideal world (no constraints on CPU time or memory), what information would make it easy to find the root cause of race reports? J |
|
From: Bart V. A. <bar...@gm...> - 2008-03-29 12:39:39
|
On Sat, Mar 29, 2008 at 12:26 AM, Julian Seward <js...@ac...> wrote: > > Konstantin wrote: > > > I'd like to collect ideas regarding the subject raised today in a > > separate thread: how to decipher Helgrind's reports about 'Possible > > data race'. > > That is an excellent question; unfortunately not easy to answer. > > Let me ask a related question. In a way it is chasing the problem > from the other end. Question is: In an ideal world (no constraints > on CPU time or memory), what information would make it easy to > find the root cause of race reports? What definitely helps is the stack traces (two or more) of all conflicting accesses and the allocation context of the address on which the conflict happened. This is sufficient for identifying the source code statements causing the conflict. Solving a data race properly can be more difficult. Sometimes basic knowledge of the software you are analyzing is sufficient, sometimes you need a deep understanding of the software. Adding more tracing always helps understanding complex cases. Bart. |
|
From: Konstantin S. <kon...@gm...> - 2008-03-29 07:14:52
|
> > I'd like to collect ideas regarding the subject raised today in a
> > separate thread: how to decipher Helgrind's reports about 'Possible
> > data race'.
>
> That is an excellent question; unfortunately not easy to answer.
>
> Let me ask a related question.
> In a way it is chasing the problem from the other end.
Right.
It's like to mince a force-meat backwards (my translation from Russian :))
> Question is: In an ideal world (no constraints
> on CPU time or memory), what information would make it easy to
> find the root cause of race reports?
I think this:
1: Record each memory access and allocation with
a) Stack trace
b) Thread and segment
c) List of held locks with stack traces of last acquisitions.
2: Complete happens-before graph where each segment is attributed with
stack trace of its beginning.
Second is easy, but less important.
First is easy if we know the address to trace (with --trace-addr)
before we start Helgrind. But we usually don't. :(
--kcc
|
|
From: Julian S. <js...@ac...> - 2008-03-29 13:33:09
|
> First is easy if we know the address to trace (with --trace-addr) > before we start Helgrind. But we usually don't. :( Would this help? Once an address is marked as SHVAL_Race, start collecting more info about it. In particular, record stack traces for the next N (say 100) memory accesses to it, or maybe better a stack trace for the first access to it from each different thread. Or some variant of these. This doesn't help find the first access to an address in a race. But on the assumption that most races happen > 1 time (as reported in the Rodehoffer "Racetrack" paper) then this might be an easy way to find some other places where the address is accessed. J |
|
From: Konstantin S. <kon...@gm...> - 2008-03-29 16:42:24
|
On Sat, Mar 29, 2008 at 4:28 PM, Julian Seward <js...@ac...> wrote:
>
> > First is easy if we know the address to trace (with --trace-addr)
> > before we start Helgrind. But we usually don't. :(
>
> Would this help?
>
> Once an address is marked as SHVAL_Race, start collecting more info
> about it. In particular, record stack traces for the next N (say 100)
> memory accesses to it, or maybe better a stack trace for the first access
> to it from each different thread. Or some variant of these.
>
> This doesn't help find the first access to an address in a race. But
> on the assumption that most races happen > 1 time (as reported in the
> Rodehoffer "Racetrack" paper) then this might be an easy way to find
> some other places where the address is accessed.
Yes, it's worth trying.
Maybe, something like this: once a race is detected and reported (i.e.
not a suppressed error), put the memory in SHVAL_New_Traced state.
And when handling all other accesses apply the same state machine, but
instead of states SHVAL_R and SHVAL_W, use SHVAL_{R,W}_Traced (and, of
course, print or collect the traces). If a race is detected again,
print a report saying that we have all the traces now.
Will do experiments next week.
This will hardly be a complete solution though.
Just an example from my last week's runs: I've run some test under
Helgrind ~20 times.
There were two reports that made me curious: one appeared in every run
and another only appeared 2-3 times.
I chased the first one with techniques described in my first message.
That race happens > 1 times.
But I still did not catch the second race and I suspect it happens
only once (it's in a destructor).
--kcc
--kcc
|
|
From: Julian S. <js...@ac...> - 2008-03-29 18:58:32
|
> Yes, it's worth trying.
> [...]
> This will hardly be a complete solution though.
> Just an example from my last week's runs: I've run some test under
> Helgrind ~20 times.
> There were two reports that made me curious: one appeared in every run
> and another only appeared 2-3 times.
Yes. I agree it is not the ideal solution. But (obviously) the problem
is that to collect all this information for all memory locations is
impossibly expensive, so restricting it to cases where we have seen a
race is a good filtering heuristic. (maybe not good enough)
Even if it can be only a 90% solution, it is much better than what we
have right now. FWIW, on Friday I spent some time with HGDEV chasing
a race in the new OpenOffice 2.4, and .. it is very difficult to make
sense of the results. Basically I gave up. ("yes, ok, I agree, there
is a race here. but where did the access(es) from other threads happen?")
Although it is true, OOo is not exactly a simple or small program :-)
J
|
|
From: Bart V. A. <bar...@gm...> - 2008-03-30 10:12:29
|
On Sat, Mar 29, 2008 at 8:54 PM, Julian Seward <js...@ac...> wrote:
>
> Even if it can be only a 90% solution, it is much better than what we
> have right now. FWIW, on Friday I spent some time with HGDEV chasing
> a race in the new OpenOffice 2.4, and .. it is very difficult to make
> sense of the results. Basically I gave up. ("yes, ok, I agree, there
> is a race here. but where did the access(es) from other threads happen?")
> Although it is true, OOo is not exactly a simple or small program :-)
If it is not immediately clear from a race report what the cause of a
data race is, it can help to insert a client request in the source
code that tells Helgrind or DRD to trace all accesses to the offending
memory location. The current approach -- one set of client requests
for Helgrind and another set of client requests for DRD -- is
inconvenient for Valgrind users. In my opinion it would be a great
service to Valgrind users if both Helgrind and DRD would use the same
client requests for e.g. tracing memory locations and suppressing race
reports, such that client code has to be instrumented only once and
such that Valgrind users can easily switch between the two tools.
And how about client requests for informing Valgrind tools about
custom memory allocator actions ? It should be sufficient if Valgrind
users instrument memory allocator code once with the
VG_USERREQ__*MEMPOOL* requests defined in <valgrind.h>. Should both
DRD and Helgrind implement support for these client requests, or is it
possible to move the mempool support code to the Valgrind core ?
Bart.
|