|
From: Konstantin S. <kon...@gm...> - 2008-02-19 14:57:19
|
Hello all,
I've been doing some more experiments with helgrind and I would like
to share some ideas...
The 'test' programs I've run have few things in common:
- Hundreds of threads.
- Heavy use of message queues, condvars, and other 'happens-before'
stuff. Mutexes are used as well.
- Strict timeouts (i.e. if some action is delayed too much, the
program starts behaving differently).
- Strict limits on memory usage.
Accuracy:
The current default memory state machine (MSMHelgrind) gives too many
false reports in presence of message queues, condvars, etc.
So far, I find MSMProp1 to be more accurate, but noticeably slower
(mostly because it has to check happens-before in all states).
Speed:
MSMHelgrind leads to 40x-150x slowdown (mostly depending on the number
of active threads, since valgrind runs on one core).
MSMProp1 is 20%-100% slower than MSMHelgrind (depending on the size of
HB graph?).
Even though it is possible to further speedup both machines, I don't
think we can get to 20x slowdown by doing just this.
If we can ignore the majority of all memory accesses, we will speedup
the whole thing.
I have implemented the following:
if ((address % X) != Y) we ignore this memory access.
X and Y are command line parameters (% operation is optimized for
some values).
0<=Y<X
typical value of X is between 3 and 30.
This hack brings helgrind's speed to something about 10x-15x and makes
my strict timeouts happy.
We will of course miss some races, but it is better than nothing
(again, this is needed only in presence of strict timeouts).
And we also can run helgrind several times giving different values of Y.
Most other ways to speedup helgrind are complimentary to this hack.
Memory:
Helgrind's memory consumption on my tests is about 2x-2.5x.
Some tests have limits on memory usage and I have to suppress those
limits (and use a machine with lots of RAM).
Also, if I modify the hack above to
if (((address >> N_SECMAP_BITS) % X) != Y)
I reduce the memory usage (since shadow values are not created for the
ignored addresses).
Segments:
If we have a lot of happens-before synchronization we create many segments.
We need to have very fast mapping between SegmentID->Segment
(especially for MSMProp1).
I suggest a resisable array of segments instead of a linked list.
And the mapping SegmentID->Segment is just an array reference. It also
reduces memory usage a bit.
We can't have more than N segments at a time (limited by RAM) so
ideally we have to recycle old segments (I did not do this yet).
Annotations:
Life is more complex than just mutex and condvar. :)
We have all the varieties of message queues and custom synchronization
mechanisms implemented via atomics.
Most of them can be explained to helgrind via annotations. Annotations
are essentially valgrind's 'client requests'.
This is what I used so far:
ANNOTATE_CONDVAR_*
Create a happens-before relation between two thread segments at
arbitrary points of program.
ANNOTATE_PCQ_*
A variation of ANNOTATE_CONDVAR_* specifically for FIFO message queue.
ANNOTATE_MUTEX_IS_USED_AS_CONDVAR(mutex)
Signal on all Unlocks and Wait on all Locks of this mutex.
On such mutex helgrind will behave as pure happens-before detector. See test61.
ANNOTATE_TRACE_MEMORY
--trace-addr is very good for testing helgrind on unit tests.
However it is useless on large programs where memory addresses are
non-deterministic (due to scheduler).
This annotation tells helgrind to trace accesses to some particular
memory location and to report races only on this address.
Multiple addresses could be traced. Useful for debugging a race.
ANNOTATE_EXPECT_RACE
Useful for regression testing of helgrind itself (if an expected race
is not detected, helgrind will complain).
ANNOTATE_BENIGN_RACE
Alternative to a suppression file.
Most of the things mentioned above are implemented at
http://code.google.com/p/data-race-test (not all of them are
'production quality' yet).
You feedback is more than welcome.
Thanks,
--kcc
|
|
From: Nicholas N. <nj...@cs...> - 2008-02-19 21:09:39
|
On Tue, 19 Feb 2008, Konstantin Serebryany wrote: > If we can ignore the majority of all memory accesses, we will speedup > the whole thing. > I have implemented the following: > if ((address % X) != Y) we ignore this memory access. > X and Y are command line parameters (% operation is optimized for > some values). > 0<=Y<X > typical value of X is between 3 and 30. So this increases the granularity by treating larger sections of memory as a single thing? More generally, a huge proportion of what Helgrind is doing is wasted effort. In many multithreaded programs, different threads share only a small amount of memory out of the total memory used. Unfortunately there's no way for Helgrind to know in advance (AFAIK) what that small amount of memory should be. Also, because threads are so crude each thread can accidentally "share" more memory than intended (ie. no inter-thread memory protection). But programmers should know in advance which bits of memory should be shared. Perhaps some client requests could be used which say "this section of memory will be shared" or "this section of memory won't be shared" could be useful. In the "won't be shared" sections the checking might be a lot simpler? Nick |
|
From: Konstantin S. <kon...@gm...> - 2008-02-20 15:37:45
|
> > So this increases the granularity by treating larger sections of memory as > a > single thing? No. It just ignores large portion of memory locations (thus missing some races). What you suggest will also lead to speedup, but will show races where there is no race actually, but likely a false sharing. > > > But programmers should know in advance which bits of memory should be > shared. Perhaps some client requests could be used which say "this > section > of memory will be shared" or "this section of memory won't be shared" > could > be useful. In the "won't be shared" sections the checking might be a lot > simpler? > Yes, I was also thinking about such requests. But in order to be helpful they must cover a large portion of all accesses (in dynamic). Also, we will have to put all these addresses into some map/hash_map/cache which may also be expensive. One more idea I am going to try: Consider we want to debug only some part of a large program. Then we can simply ignore all memory accesses done in other parts of program. Helgrind could have a command line option saying 'handle only those accesses that happen when execution stack contains FOO() or BAR()' --kcc |
|
From: Julian S. <js...@ac...> - 2008-02-20 02:41:20
|
> I've been doing some more experiments with helgrind and I would like > to share some ideas... Good. Some comments, in no particular order: > Accuracy: > > The current default memory state machine (MSMHelgrind) gives too many > false reports in presence of message queues, condvars, etc. > So far, I find MSMProp1 to be more accurate, Improved accuracy can only be a good thing. Is MSMProp1 accurate enough that you can find bugs in your real applications, without getting to many false errors now? > but noticeably slower > (mostly because it has to check happens-before in all states). Maybe some better caching of the happens-before queries would help? The current inplementation (hbefore__cache et al) uses a 64-entry fully associative cache, in effect. Maybe it would be better to have a larger, set-associative cache, eg 256 lines of 4 entries each, for example. For sure a lower miss rate on the cache is possible. Also, the miss path -- function cmpGEQ_VTS -- is naively coded, we could do a lot better there. Maybe it is possible to change the representation of VTSs so that comparison using vector operations (SSE insns, etc) is possible. Or at least so that the complex alignment logic can be avoided. > Even though it is possible to further speedup both machines, I don't > think we can get to 20x slowdown by doing just this. I agree. I think we need all the tricks we can get. > I have implemented the following: > if ((address % X) != Y) we ignore this memory access. > [...] > This hack brings helgrind's speed to something about 10x-15x and makes > my strict timeouts happy. Good! Then you can make a better evaluation of MSMProp1. > Most other ways to speedup helgrind are complimentary to this hack. Yes. As I mentioned a couple of weeks back, I think we can use the fact that mostly the FSMs are idempotent under certain circumstances, to take advantage of Helgrind's single threadedness, and so filter out potentially many memory references. > We can't have more than N segments at a time (limited by RAM) so > ideally we have to recycle old segments (I did not do this yet). Segment recycling is important, otherwise large programs cannot run for long without eating all memory. Progress on this would be good. > Most of them can be explained to helgrind via annotations. Annotations > are essentially valgrind's 'client requests'. > This is what I used so far: > > ANNOTATE_CONDVAR_* > Create a happens-before relation between two thread segments at > arbitrary points of program. > > ANNOTATE_PCQ_* > A variation of ANNOTATE_CONDVAR_* specifically for FIFO message queue. > > ANNOTATE_MUTEX_IS_USED_AS_CONDVAR(mutex) > Signal on all Unlocks and Wait on all Locks of this mutex. > On such mutex helgrind will behave as pure happens-before detector. See > test61. > > ANNOTATE_TRACE_MEMORY > --trace-addr is very good for testing helgrind on unit tests. > However it is useless on large programs where memory addresses are > non-deterministic (due to scheduler). > This annotation tells helgrind to trace accesses to some particular > memory location and to report races only on this address. > Multiple addresses could be traced. Useful for debugging a race. > > ANNOTATE_EXPECT_RACE > Useful for regression testing of helgrind itself (if an expected race > is not detected, helgrind will complain). > > ANNOTATE_BENIGN_RACE > Alternative to a suppression file. These all sound good to me. J |
|
From: Bart V. A. <bar...@gm...> - 2008-02-20 07:15:29
|
On Feb 19, 2008 10:09 PM, Nicholas Nethercote <nj...@cs...> wrote: > But programmers should know in advance which bits of memory should be > shared. Perhaps some client requests could be used which say "this section > of memory will be shared" or "this section of memory won't be shared" could > be useful. In the "won't be shared" sections the checking might be a lot > simpler? Hello Nick, Sounds nice, but a very important task for data race detection tools is to detect which data is shared between threads unintentionally. Bart Van Assche. |
|
From: Ashley P. <api...@co...> - 2008-02-20 11:52:18
|
On Wed, 2008-02-20 at 08:15 +0100, Bart Van Assche wrote: > On Feb 19, 2008 10:09 PM, Nicholas Nethercote <nj...@cs...> wrote: > > But programmers should know in advance which bits of memory should be > > shared. Perhaps some client requests could be used which say "this section > > of memory will be shared" or "this section of memory won't be shared" could > > be useful. In the "won't be shared" sections the checking might be a lot > > simpler? > > Hello Nick, > > Sounds nice, but a very important task for data race detection tools > is to detect which data is shared between threads unintentionally. That's one class of bug but there is another class where data is known to be shared but the locking is insufficient. Checking for the latter but not the former doesn't sound useful but if it gives a significant performance improvement it would be useful to have as an addition. I think the problem with it is you would need to annotate *all* library's linked with the program for it to work and that isn't going to happen easily and would be extensive and hence error prone if it did. Ashley, |
|
From: Julian S. <js...@ac...> - 2008-02-20 11:45:35
|
On Wednesday 20 February 2008 08:15, Bart Van Assche wrote: > On Feb 19, 2008 10:09 PM, Nicholas Nethercote <nj...@cs...> wrote: > > But programmers should know in advance which bits of memory should be > > shared. Perhaps some client requests could be used which say "this > > section of memory will be shared" or "this section of memory won't be > > shared" could be useful. In the "won't be shared" sections the checking > > might be a lot simpler? > > Sounds nice, but a very important task for data race detection tools > is to detect which data is shared between threads unintentionally. For Helgrind-style schemes, data that is accessed only by one thread stays in the exclusive state, and the state machine actions for exclusive states are cheaper than for shared data, since there is no need to do lockset intersections or threadset unions for Excl states. So to some extent, there already is a less-expensive (I won't say fast :-) handling case for data which is never shared. But anyway. A flag which says "assume all stack accesses are thread-local" would drastically cut the number of references to be checked, and might be a useful addition. I think old Helgrind had such a feature. The only real difficulty is deciding for sure what is and isn't a stack access at JIT time (basically impossible, we'd need a run-time filter too). J |
|
From: Bart V. A. <bar...@gm...> - 2008-02-20 15:51:11
|
On Feb 20, 2008 12:42 PM, Julian Seward <js...@ac...> wrote: > But anyway. A flag which says "assume all stack accesses are thread-local" > would drastically cut the number of references to be checked, and might > be a useful addition. I think old Helgrind had such a feature. The > only real difficulty is deciding for sure what is and isn't a stack access > at JIT time (basically impossible, we'd need a run-time filter too). There are applications that share thread-allocated data over threads. Some examples: * At least with Linuxthreads, when creating a new thread by calling pthread_create(), a pointer to the thread ID is passed as the first argument to pthread_create(). Such thread ID's are typically allocated on the stack. The newly created thread then fills in this thread ID, and some time later the creator thread reads this thread ID when calling pthread_join(). So this is an example of stack-allocated data initialized by one thread and read by another thread. I did not yet check the behavior of NPTL with regard to filling in thread ID's. * There exist several high-level C++ abstractions of threads and synchronization objects. These libraries typically associate one object with each thread, one object with each mutex, etc. It is convenient when creating threads from inside main() to allocate the instances of thread objects on the stack. The data of such objects is typically accessed both by the creator and by the created thread. Bart. |
|
From: Konstantin S. <kon...@gm...> - 2008-02-20 15:52:14
|
> > > Improved accuracy can only be a good thing. Is MSMProp1 accurate > enough that you can find bugs in your real applications, without > getting to many false errors now? Yes! (I still see many false reports due to custom synchronization and that needs code annotations, but that's another story...) > > > > but noticeably slower > > (mostly because it has to check happens-before in all states). > > Maybe some better caching of the happens-before queries would > help? The current inplementation (hbefore__cache et al) uses a > 64-entry fully associative cache, in effect. Maybe it would be > better to have a larger, set-associative cache, eg 256 lines of > 4 entries each, for example. I don't see many cache misses here. But MSMProp1 has a loop where MSMHelgrind did not have it. Anyway, this needs more experiment... > > For sure a lower miss rate on the cache is possible. Also, the > miss path -- function cmpGEQ_VTS -- is naively coded, we could > do a lot better there. Maybe it is possible to change the > representation of VTSs so that comparison using vector operations > (SSE insns, etc) is possible. Or at least so that the complex > alignment logic can be avoided. Right now I use the graph machinery, not vts. In order to support barrier I have to create a fake segment which does not belong to any thread. I did not figure out what VTS to put in that fake segment (well, I did not try hard :)) --kcc |
|
From: Julian S. <js...@ac...> - 2008-02-20 16:19:01
|
> Right now I use the graph machinery, not vts. > In order to support barrier I have to create a fake segment which does not > belong to any thread. > I did not figure out what VTS to put in that fake segment (well, I did not > try hard :)) Ah, ok, something to fix. The graph machinery is much too expensive for realistic use. The cache miss rate may be low but the miss cost can be extremely high. J |
|
From: Nicholas N. <nj...@cs...> - 2008-02-20 21:08:40
|
On Wed, 20 Feb 2008, Julian Seward wrote: >>> But programmers should know in advance which bits of memory should be >>> shared. Perhaps some client requests could be used which say "this >>> section of memory will be shared" or "this section of memory won't be >>> shared" could be useful. In the "won't be shared" sections the checking >>> might be a lot simpler? >> >> Sounds nice, but a very important task for data race detection tools >> is to detect which data is shared between threads unintentionally. > > For Helgrind-style schemes, data that is accessed only by one thread > stays in the exclusive state, and the state machine actions for exclusive > states are cheaper than for shared data, since there is no need > to do lockset intersections or threadset unions for Excl states. > So to some extent, there already is a less-expensive (I won't say fast :-) > handling case for data which is never shared. I think you're both missing the point. If you say "this memory shouldn't be shared", Helgrind could warn as soon as the memory leaves the exclusive state. Currently, Helgrind won't warn about memory shared unintentionally unless there's a race involved. Judging from Julian's comment, this mightn't make much difference to speed, but it would give stronger checking. Nick |
|
From: Julian S. <js...@ac...> - 2008-02-20 21:29:58
|
On Wednesday 20 February 2008 22:07, Nicholas Nethercote wrote:
> On Wed, 20 Feb 2008, Julian Seward wrote:
> >>> But programmers should know in advance which bits of memory should be
> >>> shared. Perhaps some client requests could be used which say "this
> >>> section of memory will be shared" or "this section of memory won't be
> >>> shared" could be useful. In the "won't be shared" sections the
> >>> checking might be a lot simpler?
> >>
> >> Sounds nice, but a very important task for data race detection tools
> >> is to detect which data is shared between threads unintentionally.
> >
> > For Helgrind-style schemes, data that is accessed only by one thread
> > stays in the exclusive state, and the state machine actions for exclusive
> > states are cheaper than for shared data, since there is no need
> > to do lockset intersections or threadset unions for Excl states.
> > So to some extent, there already is a less-expensive (I won't say fast
> > :-) handling case for data which is never shared.
>
> I think you're both missing the point. If you say "this memory shouldn't
> be shared", Helgrind could warn as soon as the memory leaves the exclusive
> state. Currently, Helgrind won't warn about memory shared unintentionally
> unless there's a race involved.
Ah. I did indeed miss the point.
Yes. So in effect it's a way for the program to communicate to the
checker, knowledge about properties of memory ("should always be
unshared") which it can check, and which it would not otherwise know.
Neat idea.
Typically only a small fraction of the memory in a program is shared;
most is unshared. It might be easier to turn it upside down, so the
tool is notified of areas which may be shared. Hmm. Not sure if
that is a good idea. It would give stronger checking but would
require exhaustively annotating all locations which might become
shared.
J
|
|
From: Nicholas N. <nj...@cs...> - 2008-02-21 06:02:02
|
On Wed, 20 Feb 2008, Julian Seward wrote: > Typically only a small fraction of the memory in a program is shared; > most is unshared. Exactly. One of the nasty things about threads is that this fact isn't recognised in the programming model -- every thread has full access to every byte of memory that every other thread has access to, there's no protection. Nick |
|
From: Bart V. A. <bar...@gm...> - 2008-02-21 10:31:37
|
On Wed, Feb 20, 2008 at 10:26 PM, Julian Seward <js...@ac...> wrote: > Typically only a small fraction of the memory in a program is shared; > most is unshared. It might be easier to turn it upside down, so the > tool is notified of areas which may be shared. Hmm. Not sure if > that is a good idea. It would give stronger checking but would > require exhaustively annotating all locations which might become > shared. There exist applications where most data is shared between threads, and only a small fraction is owned by only one thread. E.g. in multithreaded finite element simulations the majority of the data consists of element data, which is shared over all computation threads. Bart Van Assche. |
|
From: Bart V. A. <bar...@gm...> - 2008-02-21 11:19:32
|
On Wed, Feb 20, 2008 at 10:26 PM, Julian Seward <js...@ac...> wrote:
> Typically only a small fraction of the memory in a program is shared;
> most is unshared. It might be easier to turn it upside down, so the
> tool is notified of areas which may be shared. Hmm. Not sure if
> that is a good idea. It would give stronger checking but would
> require exhaustively annotating all locations which might become
> shared.
In my previous e-mail I forgot to mention that exhaustively annotating
all shared locations would require to modify standard libraries.
Consider e.g. the std::string class defined in the ANSI C++ standard.
The language standard allows the standard library implementor to
choose between an implementation based on reference counting or an
implementation that copies the contents of a string in the copy
constructor. Consider now the following typical implementation of an
accessor function:
class PersonInformation;
std::string PersonInformation::GetName()
{ /* lock mutex */; std::string result = m_Name; /* unlock mutex */;
return result; }
If the member function GetName() is called from another thread than
the thread in which m_Name was initialized, the data inside the m_Name
object will or will not be shared over threads. This depends on the
implementation of the std::string class (reference counting or
copying). Or: any annotations with regard to sharing should be added
inside the std::string class implementation. And the implementation of
this class is inside a library header, which we do not want to modify.
Bart.
|
|
From: Julian S. <js...@ac...> - 2008-02-20 22:36:43
|
> > For sure a lower miss rate on the cache is possible. Also, the
> > miss path -- function cmpGEQ_VTS -- is naively coded, we could
> > do a lot better there. Maybe it is possible to change the
> > representation of VTSs so that comparison using vector operations
> > (SSE insns, etc) is possible. Or at least so that the complex
> > alignment logic can be avoided.
>
> Right now I use the graph machinery, not vts.
> In order to support barrier I have to create a fake segment which does not
> belong to any thread.
Another way is with O(log N) new segments. No need for segments
that do not belong to any thread, and you can implement it using
the existing Segment representation easily.
Suppose there are 4 threads, t1, t2, t3, t4
* wait for all the threads to arrive at the barrier. When the
last one arrives:
* make a binary tree:
- "seg12": new segment belonging to t2, depending on t1 and old t2
- "seg34": new segment belonging to t4, depending on t3 and old t4
- "seg1234": new segment belonging to t4, depending on seg12 and seg34
* now seg1234 depends on all threads arriving at the barrier.
* now construct new segments for t1, t2, t3, t4, which of course
depend on the previous segments in the same thread, and also on
seg1234
I heard that this technique or something like it is explained in
Arndt Muehlenfeld's thesis, although I did not read it.
It is inefficient (in space) compared to having a single fake segment.
But at least you can implement it using the 2-input Segment nodes in
the original Helgrind. For a single fake segment, you would need
to represent N input dependencies, which makes the representation more
complex. Note that because the VTS effectively caches the result of
all hb-queries between all segment-pairs, this technique is not
inefficient in time, only in space (and then only O(log N) cost).
> I did not figure out what VTS to put in that fake segment (well, I did not
> try hard :))
General rule:
supposing thread T is starting a new segment, giving segments Told and Tnew
and also we want to add dependence on some other segment S, then
Tnew->vts = tickL_and_joinR( Told->vts, S->vts )
assuming that Tnew->prev == Told and Tnew->other == S
and it is important that Told and Tnew are from the same thread.
J
|
|
From: Konstantin S. <kon...@gm...> - 2008-02-21 11:44:46
|
> > Another way is with O(log N) new segments. No need for segments > that do not belong to any thread, and you can implement it using > the existing Segment representation easily.... Emm, not sure about this. What I did is not just to implement barrier, but to implement any graph structure where one segment may have arbitrary number of segments that happen-before it. A more general example is when we have N1 threads signaling on one CV and N2 threads waiting on this CV. I heard that this technique or something like it is explained in > Arndt Muehlenfeld's thesis, although I did not read it. I did exactly what described there. See http://code.google.com/p/data-race-test/wiki/BarrierSupport. > > > It is inefficient (in space) compared to having a single fake segment. For barrier with N threads I create N fake segments. So it's 2x less memory-efficient than in current helgrind. > > |