|
From: Jeremy F. <je...@go...> - 2002-10-07 00:14:47
|
Can go into a few more details about what's wrong with helgrind? Does
it get the Eraser algorithm wrong, or does it just need some more
refinement?
I read the Eraser paper and the extension suggested in "Visual Threads",
and it all looks pretty straightforward. Once I get some profiling
machinery going, I'm definitely interested in working on helgrind.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-10-07 10:10:37
|
On 6 Oct 2002, Jeremy Fitzhardinge wrote: > Can go into a few more details about what's wrong with helgrind? Does > it get the Eraser algorithm wrong, or does it just need some more > refinement? > > I read the Eraser paper and the extension suggested in "Visual Threads", > and it all looks pretty straightforward. Once I get some profiling > machinery going, I'm definitely interested in working on helgrind. That would be really good. It's been a while since I looked at it. My first attempt at the lockset stuff was very simple and probably wasn't going to scale at all well, but then Julian improved the lockset stuff a bit (a lot?) so that might not be such a problem yet. I think the problem is more that I never tried it on any program larger than about 20 lines. It could find easy data races in these short programs. But it also reported quite a few races in glibc. One of these we found to be genuine (an unsafe update of a totally trivial statistics counter only printed for debugging), but for the others it was unclear whether they were genuine or if Helgrind was wrong. I certainly tried to implement the Eraser algorithm faithfully. So to summarise: it probably needs to be robustified to handle large programs, and it hasn't been verified as to whether it's finding real races yet. N |
|
From: Julian S. <js...@ac...> - 2002-10-09 18:40:48
|
Something is seriously snafu'd with sourceforge mailing lists; Nick sent this at 11am monday and it arrived here sometime Weds afternoon. > > and it all looks pretty straightforward. Once I get some profiling > > machinery going, I'm definitely interested in working on helgrind. > > That would be really good. It's been a while since I looked at it. My > first attempt at the lockset stuff was very simple and probably wasn't > going to scale at all well, but then Julian improved the lockset stuff a > bit (a lot?) so that might not be such a problem yet. Yes, please hack on! Re my changes, all I did was debug the set operations, so it's stable on large programs. I also and restructured hg_main.c a but, but I didn't change the efficiency of it at all. cvs log of hg_main.c should tell you more. J |
|
From: Julian S. <js...@ac...> - 2002-10-07 17:53:47
|
On Monday 07 October 2002 1:14 am, Jeremy Fitzhardinge wrote: > Can go into a few more details about what's wrong with helgrind? Does > it get the Eraser algorithm wrong, or does it just need some more > refinement? > > I read the Eraser paper and the extension suggested in "Visual Threads", > and it all looks pretty straightforward. Once I get some profiling > machinery going, I'm definitely interested in working on helgrind. It generates bazillions of errors. We don't know why. The only _known_ cause for tyhis is that glibc has a counter for the number of relocations done by the dynamic linker, and this is incremented without locking. That's allegedly harmless because it's only for stats purposes, but it's still something that needs to be suippressed or worked around. Problem with suppression is that there is no fixed call stack at the error point; it happens at the first use of every dynamic symbol (or something) so the normal stack-based-identification of the suppression mechanism won't work. We haven't thought of a way round this yet. If you want to investigate that would be great. Having a usable race detector would be really excellent. J |
|
From: Jeremy F. <je...@go...> - 2002-10-08 01:09:56
|
On Mon, 2002-10-07 at 11:00, Julian Seward wrote:
On Monday 07 October 2002 1:14 am, Jeremy Fitzhardinge wrote:
> Can go into a few more details about what's wrong with helgrind? Does
> it get the Eraser algorithm wrong, or does it just need some more
> refinement?
>
> I read the Eraser paper and the extension suggested in "Visual Threads",
> and it all looks pretty straightforward. Once I get some profiling
> machinery going, I'm definitely interested in working on helgrind.
It generates bazillions of errors. We don't know why.
The only _known_ cause for tyhis is that glibc has a counter for
the number of relocations done by the dynamic linker, and this is
incremented without locking. That's allegedly harmless because
it's only for stats purposes, but it's still something that needs
to be suippressed or worked around. Problem with suppression is
that there is no fixed call stack at the error point; it happens
at the first use of every dynamic symbol (or something) so the
normal stack-based-identification of the suppression mechanism
won't work. We haven't thought of a way round this yet.
I had a play with this last night, and noticed the dynamic linker count
thing. Several things occurred to me:
* Obviously, helgrind needs to have suppression implemented. I
suspect it needs dynamic suppression (hints inserted in the
code) as well as static suppression rules.
* I noticed that the symtab stuff only collects function
symbols, but we're really going to want static and global data
symbols as well, so they can be used for both reports and
suppressions.
* Also, a fair amount of the memcheck error report stuff needs
to be reused as well, so dynamic memory can be identified in
terms of its allocation site.
* I think the algorithm needs a new state, which means "can be
touched by any thread for any reason". This would mainly be
for error suppression, so that you can annotate it that way,
and so that once you've reported that location you can silence
further errors. Perhaps you could use "exclusive" state with
a magic thread ID meaning "everyone".
* The Eraser paper didn't cover the case where a mutex identity
can be reused (say the mutex is part of an object in dynamic
memory which is reallocated). I'm guessing that on freeing
some memory, you have to go through all the mutex sets looking
for a mutex in the freed memory range, and poison that set
somehow.
* The Eraser paper also used a resolution of 4 bytes because
that's the smallest the alpha guarantees for atomic
operations. The x86 goes down to 1 byte; I wonder if we need
more resolution (ie, are there neighbouring memory locations
being updated by different threads, which are causing spurious
errors?).
J
|