|
From: Philippe W. <phi...@sk...> - 2012-02-13 21:45:40
Attachments:
patch_mt_hack.txt
|
I quickly modified Valgrind to have it multi-threaded.
The approach:
1. is only done with the "pipe" locking
2. is using in total 3 "pipe semaphores" to implement a reader/writer
locking schema. (using atomic increment/decrement would allow
to have only two semaphores).
3. only none tool "works" (?) (see below for the definition of "works").
4. is a real hack. ("calling this a hack is an insult to other hacks").
A.o. - I have disabled some assertions related to the dispatch_cntr
(which is a global variable now and should be per thread).
- the concept of "the" running tid being broken, several other
asserts blindly commented
- no search for any efficiency. E.g. upgrading a read lock to
a write lock is done by releasing the read lock, and obtaining
a write lock. E.g. if there is a mixture of threads blocked
in a syscall and burning, the scheduling is slower
- it is missing for sure plently of 'multi-thread related' asserts
- zillions of things are probably not working properly
e.g. debug printing, ...
- ....
In any case, here are some results:
time ./vg-in-place --tool=none ./gdbserver_tests/parallel_sleepers 50000 0 50000 B-B-B-B-
==5365== Nulgrind, the minimal Valgrind tool
....
real 0m9.920s
user 0m38.038s
sys 0m0.188s
time ../trunk_untouched/vg-in-place --tool=none
./gdbserver_tests/parallel_sleepers 50000 0 50000 B-B-B-B-
==5374== Nulgrind, the minimal Valgrind tool
....
real 0m38.569s
user 0m37.410s
sys 0m0.180s
So, we see that parallel_sleepers (which is in fact 4 threads burning CPU)
is taking about 38.5 seconds real and 37.4 cpu with the trunk, and is
taking about 9.9 seconds real and 38 seconds cpu with the hacked Valgrind.
Of course, only none tool is done :).
And of course, it is just an unvalidated hack done in 3 hours.
(185 tests failing, but only 22 none tests failing :)
But if we can make the tools somewhat multi-threaded and the rest of
the coregrind properly protected, it looks promising ....
I am attaching the patch just for information/starting a discussion.
It is for sure totally out of question to commit this.
Philippe
|
|
From: Christian B. <bor...@de...> - 2012-02-14 08:05:33
|
On 13/02/12 22:45, Philippe Waroquiers wrote:
> I quickly modified Valgrind to have it multi-threaded.
>
> The approach:
> 1. is only done with the "pipe" locking
> 2. is using in total 3 "pipe semaphores" to implement a reader/writer
> locking schema. (using atomic increment/decrement would allow
> to have only two semaphores).
> 3. only none tool "works" (?) (see below for the definition of "works").
> 4. is a real hack. ("calling this a hack is an insult to other hacks").
> A.o. - I have disabled some assertions related to the dispatch_cntr
> (which is a global variable now and should be per thread).
> - the concept of "the" running tid being broken, several other
> asserts blindly commented
> - no search for any efficiency. E.g. upgrading a read lock to
> a write lock is done by releasing the read lock, and obtaining
> a write lock. E.g. if there is a mixture of threads blocked
> in a syscall and burning, the scheduling is slower
> - it is missing for sure plently of 'multi-thread related' asserts
> - zillions of things are probably not working properly
> e.g. debug printing, ...
> - ....
Nice performance results.I think we have to go that path somewhen.
VEX is currently not thread safe
(eg. guest_EIP_curr_instr in x86 or guest_IA_curr_instr in s390)
and we probably have to start over there.
Christian
|
|
From: Julian S. <js...@ac...> - 2012-02-14 09:18:47
|
On Tuesday, February 14, 2012, Christian Borntraeger wrote:
> On 13/02/12 22:45, Philippe Waroquiers wrote:
> > I quickly modified Valgrind to have it multi-threaded.
> >
> > The approach:
> > 1. is only done with the "pipe" locking
> > 2. is using in total 3 "pipe semaphores" to implement a reader/writer
> >
> > locking schema. (using atomic increment/decrement would allow
> > to have only two semaphores).
> >
> > 3. only none tool "works" (?) (see below for the definition of "works").
> > 4. is a real hack. ("calling this a hack is an insult to other hacks").
> >
> > A.o. - I have disabled some assertions related to the dispatch_cntr
> >
> > (which is a global variable now and should be per thread).
> >
> > - the concept of "the" running tid being broken, several other
> >
> > asserts blindly commented
> >
> > - no search for any efficiency. E.g. upgrading a read lock to
> >
> > a write lock is done by releasing the read lock, and obtaining
> > a write lock. E.g. if there is a mixture of threads blocked
> > in a syscall and burning, the scheduling is slower
> >
> > - it is missing for sure plently of 'multi-thread related' asserts
> > - zillions of things are probably not working properly
> >
> > e.g. debug printing, ...
> >
> > - ....
>
> Nice performance results.I think we have to go that path somewhen.
Yes. Excellent stuff.
> VEX is currently not thread safe
> (eg. guest_EIP_curr_instr in x86 or guest_IA_curr_instr in s390)
> and we probably have to start over there.
There are probably hundreds of places in the system where global state
is used, that will need to be found and fixed.
I think a small investment of effort that will pay off in a big way
with this work is to annotate the pipe locking code with
ANNOTATE_HAPPENS_{BEFORE,AFTER} macros, and possibly
VALGRIND_HG_CLEAN_MEMORY, so that we can race-check the system with
DRD and Helgrind. That is far more effective and stress-free than
trying to find all the racey places "by hand".
J
|
|
From: Bart V. A. <bva...@ac...> - 2012-02-15 07:22:41
|
On Wed, Feb 15, 2012 at 12:54 AM, Philippe Waroquiers <phi...@sk...> wrote: > So, we might need to have a bunch of readers/writer locking > (for some data structures) and some fine grained locking for > smaller data structures (e.g. the memcheck free list, the malloc data > structures, ...). > With the current implementation, we need 3 pipes for each readers/writer > lock (2 if we use atomic instructions) + we will need one pipe for each > "fine grained" lock. => it looks like a more lightweight locking > mechanism might be needed (futex based maybe ? but what for Darwin ? > maybe use normal pthread mutexes on Darwin ? ) It looks like the right time to me to add locking primitives in the Valgrind core (e.g. mutual exclusion + reader-writer lock). The Darwin source code is available online under a BSD license. You can have a look at it to see how locking primitives have been implemented in their C library. Note: glibc mutexes do a little more than invoking the futex syscalls directly - if I remember correctly the pthread_mutex_lock() implementation first spins for a short time in order to minimize the number of system calls invoked. Bart. |
|
From: Tom H. <to...@co...> - 2012-02-15 08:28:44
|
On 15/02/12 07:22, Bart Van Assche wrote: > Note: glibc mutexes do a little more than invoking the futex syscalls > directly - if I remember correctly the pthread_mutex_lock() > implementation first spins for a short time in order to minimize the > number of system calls invoked. That's the whole point of the futex - you only have to make a system call and enter the kernel when the lock is contended. I don't think it actually spins in userspace, it's just that it is able to check (and if available take) the lock from userspace and only enter the kernel if it needs to block. Wikipedia has a summary and links to various papers: http://en.wikipedia.org/wiki/Futex Tom -- Tom Hughes (to...@co...) http://compton.nu/ |
|
From: Julian S. <js...@ac...> - 2012-02-15 08:59:48
|
On Wednesday, February 15, 2012, Tom Hughes wrote: > On 15/02/12 07:22, Bart Van Assche wrote: > > Note: glibc mutexes do a little more than invoking the futex syscalls > > directly - if I remember correctly the pthread_mutex_lock() > > implementation first spins for a short time in order to minimize the > > number of system calls invoked. > > That's the whole point of the futex - you only have to make a system > call and enter the kernel when the lock is contended. > > I don't think it actually spins in userspace, it's just that it is able > to check (and if available take) the lock from userspace and only enter > the kernel if it needs to block. For sure we will need to have our own library of locking primitives. The non-contended case, using atomic memory operations, is easy enough (for some definition of easy. Drepper's "Futexes are tricky" shows how hairy even that part can be). The tricky bit IMO is going to be the contended case. On Linux we can fall back to futexes in the normal way, but on Darwin I imagine the kernel-level helpers for contended cases are completely different. And may even be different in different OSX versions. I'm sure I saw a bunch more threading-sounding syscalls for 10.7 that weren't present in 10.6. J |
|
From: Bart V. A. <bva...@ac...> - 2012-02-15 09:51:23
|
On Wed, Feb 15, 2012 at 9:28 AM, Tom Hughes <to...@co...> wrote: > I don't think it actually spins in userspace, it's just that it is able to > check (and if available take) the lock from userspace and only enter the > kernel if it needs to block. Not that it's important, but as far as I can see the pthread_mutex_lock() implementation does spin in userspace for mutexes of type PTHREAD_MUTEX_ADAPTIVE_NP. See e.g. http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c. Bart. |
|
From: Philippe W. <phi...@sk...> - 2012-02-14 23:54:15
|
On Tue, 2012-02-14 at 10:11 +0100, Julian Seward wrote:
> > VEX is currently not thread safe
> > (eg. guest_EIP_curr_instr in x86 or guest_IA_curr_instr in s390)
> > and we probably have to start over there.
>
> There are probably hundreds of places in the system where global state
> is used, that will need to be found and fixed.
>
> I think a small investment of effort that will pay off in a big way
> with this work is to annotate the pipe locking code with
> ANNOTATE_HAPPENS_{BEFORE,AFTER} macros, and possibly
> VALGRIND_HG_CLEAN_MEMORY, so that we can race-check the system with
> DRD and Helgrind. That is far more effective and stress-free than
> trying to find all the racey places "by hand".
Effectively, using helgrind/drd to find the race conditions is a
very attractive thing to do (also because "eat your own dog food" :).
Note however that I am not at all sure that the (hacked) prototype
is the correct way to go. Or at least, there is still a whole bunch
of unclear points to me.
For example: we have two threads which are running user code.
(so, each thread has a "read lock" via the scheduler locking,
aimed at protecting a.o. the translation table).
Both threads are doing a "free" in parallel (memcheck tool) and so
will each need to modify the list of recently freed blocks.
With the current 'readers/writer' only lock, it is not clear to me
that this will work : if these threads are releasing their read locks
to acquire a write lock, then it might be that this causes a problem
(e.g. the currently executed translated code might be evicted from
the cache). If they try to upgrade the read lock to a write lock,
the two threads will deadlock.
So, we might need to have a bunch of readers/writer locking
(for some data structures) and some fine grained locking for
smaller data structures (e.g. the memcheck free list, the malloc data
structures, ...).
With the current implementation, we need 3 pipes for each readers/writer
lock (2 if we use atomic instructions) + we will need one pipe for each
"fine grained" lock. => it looks like a more lightweight locking
mechanism might be needed (futex based maybe ? but what for Darwin ?
maybe use normal pthread mutexes on Darwin ? )
All what is easy to make multi-threaded (maybe VEX ?) would
preferably be done (rather than to add locking).
At this stage, how to attack the multi_thread valgrind is not yet clear
to me. I was hoping the prototype would help to understand the
problem and/or start the discussion (at least the 2nd point is a
success :).
Philippe
|
|
From: Julian S. <js...@ac...> - 2012-02-15 09:35:42
|
> Note however that I am not at all sure that the (hacked) prototype > is the correct way to go. Me neither; but OTOH we have to start to learn about the problem somehow, and this seems as good a way as any. Even if it is nothing like how the final implementation will look. > Or at least, there is still a whole bunch > of unclear points to me. Me too :-) One conceptual distinction which I think will be useful (or, really, essential) is distinguish two different problems: (1) making the core multithreaded (that is, having it run with --tool=none reliably) and (2) making the tools multithreaded. (1) is a lot easier than (2), and I would suggest to aim to achieve it as a first step. Speaking at least for Memcheck and Helgrind, both tools currently rely heavily on the single threadedness. I have studied Memcheck's design in quite some detail w.r.t. threading, and many of the space saving optimsations in it will be invalidated by threading. In short it will require a complete reimplementation of the shadow memory stuff (mc_main.c, basically). Assuming we concentrate on (1) for now .. (which for sure is by itself more than enough complexity ..) > For example: we have two threads which are running user code. > (so, each thread has a "read lock" via the scheduler locking, > aimed at protecting a.o. the translation table). Urr, so just to concentrate on the basic execution mechanism .. If we have all threads executing translations in parallel and assume there is no problem with locking .. fine. But suppose one thread leaves generated code because a translation needs to be made. It's probably doable to make VEX run MT to do the translation, but when we come to add it to the translation table and tt_fast cache .. do we need locking there? And if a thread leaves generated code in order to remove a translation for whatever reason .. how do we do that without crashing the other running threads? J |
|
From: Philippe W. <phi...@sk...> - 2012-02-15 20:25:03
|
On Wed, 2012-02-15 at 10:28 +0100, Julian Seward wrote:
> Assuming we concentrate on (1) for now .. (which for sure is
> by itself more than enough complexity ..)
Effectively a complex/nice objective by itself. OTOH, we might need
to look at what the core needs to provide to allow a tool to be
multi-threaded and understand the interactions between core and tool
in terms of locking.
>
> > For example: we have two threads which are running user code.
> > (so, each thread has a "read lock" via the scheduler locking,
> > aimed at protecting a.o. the translation table).
>
> Urr, so just to concentrate on the basic execution mechanism ..
>
> If we have all threads executing translations in parallel and
> assume there is no problem with locking .. fine. But suppose
> one thread leaves generated code because a translation needs to
> be made. It's probably doable to make VEX run MT to do the
> translation, but when we come to add it to the translation
> table and tt_fast cache .. do we need locking there?
Currently, the reader/writer lock protects the call to VG_(translate)
(and so all what is behind i.e. VEX, the translation table, ...).
I understand that the code translation is not a big part of the
cost, so using locking here might be the best approach.
Not too clear what would be the alternatives:
* a local translation table per thread ?
=> (what about memory ?)
* a lockless data structure ? (e.g. lockless hash table ?)
=> what about performance of the search ?
=> and we still have to avoid the below problem, i.e.
make a translation disappear when a thread is using it.
>
> And if a thread leaves generated code in order to remove
> a translation for whatever reason .. how do we do that without
> crashing the other running threads?
So, the proto reader/writer locking for this might be the good approach.
Philippe
|
|
From: Greg P. <gp...@ap...> - 2012-02-15 17:31:12
|
On Feb 15, 2012, at 12:52 AM, Julian Seward <js...@ac...> wrote: > The tricky bit IMO is going to be the contended case. On Linux we > can fall back to futexes in the normal way, but on Darwin I imagine > the kernel-level helpers for contended cases are completely different. > And may even be different in different OSX versions. I'm sure I saw > a bunch more threading-sounding syscalls for 10.7 that weren't > present in 10.6. If you're rolling your own synchronization code then the Mach semaphore syscalls ought to be straightforward and as consistent as anything else valgrind does. The pthread syscalls change frequently across OS versions. -- Greg Parker gp...@ap... Runtime Wrangler |
|
From: Bart V. A. <bva...@ac...> - 2012-02-16 07:21:31
|
On Wed, Feb 15, 2012 at 10:28 AM, Julian Seward <js...@ac...> wrote: > And if a thread leaves generated code in order to remove > a translation for whatever reason .. how do we do that without > crashing the other running threads? Maybe we should report this to the user as a (data race) error instead of trying to address this now. Bart. |
|
From: Dave G. <go...@mc...> - 2012-02-16 17:26:50
|
On Feb 16, 2012, at 1:21 AM CST, Bart Van Assche wrote: > On Wed, Feb 15, 2012 at 10:28 AM, Julian Seward <js...@ac...> wrote: >> And if a thread leaves generated code in order to remove >> a translation for whatever reason .. how do we do that without >> crashing the other running threads? > > Maybe we should report this to the user as a (data race) error instead > of trying to address this now. What about using Userspace RCU to protect access to the translation table? Wouldn't this allow you to ensure that the remover won't clobber any threads currently using that translation while keeping the fast-path fast? Or is this too complex for the current stage of design? -Dave |