|
From: Jeremy F. <je...@go...> - 2005-01-07 05:06:31
|
Over the hols I decided it would be fun to rewrite Valgrind's syscall/threading stuff again. The basic idea came from Eric Estievenart in a discussion earlier this year. Rather than having the "main thread" with a bunch of proxy LWP threads to deal with signals and so on, every app-level thread gets its own kernel-level thread, so that the thread/process structure of the program is identical inside and outside Valgrind. The thread execution is still serialized, so that only one client thread can run at once. In addition to this change, I killed Valgrind's pthread library. Instead, Valgrind supports the clone syscall, plus all the other threading-related syscalls (futex, set_thread_area, set_pid_address, gettid, tkill, tgkill, etc). This allows the system threads library to work under Valgrind. I don't support full general clone, but it does work with both NPTL and LinuxThreads, as well as using clone() to do fork and vfork (vfork is mapped to fork, as before). The chief advantage of these changes is simplicity. Valgrind now implements a much thinner layer over the kernel, and doesn't try to second-guess the kernel's behaviour. This means, for example, that there's no more "mode switch" for 2.4 and 2.6-style kernels - they both work equally well, despite having significant differences in thread support and signal handling. vg_libthread has been becoming increasingly troublesome, and is basically completely broken (for example, when using a multithreaded program with a Tool which doesn't replace malloc, there was no locking to protect the libc malloc from concurrent access). Any system thread library should work now, including home-rolled ones (with a bit of tweaking). All this has resulted in Valgrind shrinking by about 6600 lines of code, including all of vg_proxylwp.c, vg_libpthread.c and about 2200 lines from vg_scheduler.c - some of the hairest, trickiest and error-prone code in the codebase. There are some disadvantages though. In losing vg_libpthread, Valgrind has a much more limited understanding of what's going on inside a threaded program. It can see threads starting and exiting, but that's about it. It can't tell when a mutex has been locked or unlocked; it can't even really tell which threads are waiting for another to exit. This means that all the core pthreads error reporting has gone, and helgrind is basically useless. We can probably glean a bit more information than we currently are, but not a lot. Probably the best way to fix this is to work with the glibc people to try to build some Valgrind support into glibc, and/or take advantage of the existing hooks which gdb uses. I haven't investigated this at all yet. The patch as it currently stands works well for me. I've mostly tested it on my FC2/2.6.10 system, using both LinuxThreads and NPTL threads libraries. The regression tests mostly pass, and I can run OOo and firefox under nulgrind and memcheck. Other changes: I took advantage of the general upheaval to try to isolate more of the OS/arch/platform dependent code. vg_scheduler is now (almost?) completely free of OS-specific knowledge: it just assumes there are things called threads, they can be running or blocked, interrupted by async events, and eventually they exit, but all the details are left up to other OS-specific code. Also, one widespread change was renaming get_current_tid to get_VCPU_tid. The old name was too ambiguous in the context of the new threading, and the new name better describes what it actually does. So, the patch is at http://www.goop.org/~jeremy/valgrind/new-threading.patch.bz2 If people could try it out and see if it basically works for them, I'll check it in the next few days. In particular, I'd like it if the people considering porting Valgrind to other OS's could have a look and see how these changes affect them. TODO: * Examine all the regression test failures, and work out what's going on. Some are trivial (thread IDs have changed), some are irrelevent (pthreads error checking) and some are strange (particularly the programs which work when run standalone, but fail when run in harness). * Examine all the new errors popping out of libpthread, and work out what's going on. Some need suppressing, but some might be real bugs (in either Valgrind or libpthread). * Test on more machines. I did a brief test on an RH7.3/2.4 machine, and it nearly worked. It seems that its libpthread does something interesting with segment registers which isn't supported by Valgrind. I think it wants LDT entries shared between threads. It should be possible to get 2.4 (or even 2.2 if we wanted to) systems supported without too much programming effort. * Try to recover the lost pthreads functionality. All that pthreads API usage checking was useful, and it uncovered a lot of bugs. We should do as much as we can to try to work out what the client is doing using available evidence, and also look into getting help from the threads library. The tricky part will be making sure this doesn't end up being another track-the-version-of-the-day maintenance headache. * I implemented the thread-serializing run_sema with both a futex and a pipe-based token-passing scheme. The futex code is more efficient, but the pipe scheme works everywhere, so that's what is enabled by default. Switching is a compile-time option, but it could be done at runtime. WOULD LIKE TO DO (comments?): * I noticed that ThreadState * and ThreadId are synonyms, and there's a lot of redundant code for converting between the two. I would like to drop ThreadId, and just use ThreadState * everywhere. Outside the core, it would just be an opaque pointer. * As a corollary, I would also like to get rid of the VG_(threads) array and dynamically allocate ThreadState structures, and get rid of the hard upper limit of VG_N_THREADS. The only significant bad effect this might have is on Tools which maintain their own arrays for storing per-thread information (like helgrind). We can just change them to add a per-thread-info-for-Tools field to ThreadState. J |
|
From: Jeremy F. <je...@go...> - 2005-01-07 07:59:41
|
On Thu, 2005-01-06 at 21:06 -0800, Jeremy Fitzhardinge wrote: > The basic idea came from Eric Estievenart in a discussion earlier this > year. Erm, last year. I should have proofread the draft? J |
|
From: Julian S. <js...@ac...> - 2005-01-09 19:14:22
|
> Over the hols I decided it would be fun to rewrite Valgrind's
> syscall/threading stuff again.
>
> The basic idea came from Eric Estievenart [...]
That's great! I'm very pleased to see this. Especially losing 6600
loc sounds good to me.
> If people could try it out and see if it basically works for them, I'll
> check it in the next few days. In particular, I'd like it if the people
> considering porting Valgrind to other OS's could have a look and see how
> these changes affect them.
Before this goes in, I'd really like that
(1) support for 2.4, and particularly 2.4 on older systems (eg, your
2.4 on RH73) gets more beaten on and tested, since in reality we will
have to give good support for 2.4 for at least another two years.
(2) more regression test cases are fixed.
Are there any performance consequences that you are aware of? I had
some concern that programs which cycle rapidly between running threads
would be slower now, because each thread switch involves a trip into
the kernel where it didn't before.
> * Try to recover the lost pthreads functionality. All that
> pthreads API usage checking was useful, and it uncovered a lot
> of bugs.
Yes. That would be nice but is not critical. Perhaps a useful thing
to do at this point, assuming no nice-easy solution is available,
is to come up with a proposal which can be considered at length.
> * I implemented the thread-serializing run_sema with both a futex
> and a pipe-based token-passing scheme. The futex code is more
> efficient, but the pipe scheme works everywhere, so that's what
> is enabled by default. Switching is a compile-time option, but
> it could be done at runtime.
Do you have any numbers indicating the relative expenses of the two
schemes?
J
|
|
From: Jeremy F. <je...@go...> - 2005-01-09 23:56:58
|
On Sun, 2005-01-09 at 19:14 +0000, Julian Seward wrote:
> > If people could try it out and see if it basically works for them, I'll
> > check it in the next few days. In particular, I'd like it if the people
> > considering porting Valgrind to other OS's could have a look and see how
> > these changes affect them.
>
> Before this goes in, I'd really like that
>
> (1) support for 2.4, and particularly 2.4 on older systems (eg, your
> 2.4 on RH73) gets more beaten on and tested, since in reality we will
> have to give good support for 2.4 for at least another two years.
I can try FC2 with 2.4 kernel, but I expect it would be the same as
using LD_ASSUME_KERNEL=2.4.0 or removing /lib/tls. I don't have any old
distros lying about. I figured we could fix the old stuff once the rest
is checked in (particularly since the RH7.3 problem seems to be in Tom's
bit, hint, hint).
2.4 support will be a matter of fixing little things, since there's no
major code-path differences for 2.4 vs 2.6.
> Are there any performance consequences that you are aware of? I had
> some concern that programs which cycle rapidly between running threads
> would be slower now, because each thread switch involves a trip into
> the kernel where it didn't before.
Not noticable. Besides, the client running native would be doing kernel
context switches anyway, so I don't see that it makes any real
difference.
In fact, I'd expect a bit of an improvement, since the old code used
many syscalls per client syscall, but now it's at most 4.
> > * I implemented the thread-serializing run_sema with both a futex
> > and a pipe-based token-passing scheme. The futex code is more
> > efficient, but the pipe scheme works everywhere, so that's what
> > is enabled by default. Switching is a compile-time option, but
> > it could be done at runtime.
>
> Do you have any numbers indicating the relative expenses of the two
> schemes?
Not measurable in single-threaded programs, which is where it would make
the most difference (2 syscalls/context switch vs. 0). I'll stick with
the pipe code.
J
|
|
From: Tom H. <th...@cy...> - 2005-01-10 09:45:59
|
In message <1105074377.32288.52.camel@localhost>
Jeremy Fitzhardinge <je...@go...> wrote:
> So, the patch is at
> http://www.goop.org/~jeremy/valgrind/new-threading.patch.bz2
>
> If people could try it out and see if it basically works for them, I'll
> check it in the next few days. In particular, I'd like it if the people
> considering porting Valgrind to other OS's could have a look and see how
> these changes affect them.
Right. I've tried it on a range of systems now. Initial comments
are that it all looks pretty nice but I have a few problems...
- First up, you have used AS_HELP_STRING in configure.in but
none of my older systems know that even though they have
newer autoconfs than they originally shipped with. Only a
problem for people using CVS of course.
- Second, on RH9 the clone used to start threads is not trapped
because CLONE_SYSVSEM is not set. I just added an extra case
for that to the switch.
The big one is that something is causing it to hang. Different systems
hang in different places but all my systems hang at some point during
the regression tests:
- RH72 and RH73 hang in corecheck/tests/pth_atfork1
- RH80 hangs in corecheck/tests/pth_cancel1
- RH9 and FC3 hang in none/tests/x86/yield
I haven't tried to work out what is causing any of the hangs yet.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Tom H. <th...@cy...> - 2005-01-10 09:59:35
|
In message <yek...@au...>
Tom Hughes <th...@cy...> wrote:
> - RH9 and FC3 hang in none/tests/x86/yield
This is a bug in the test - it does a signal on the CV when it
actually wants to wake both threads so should do a broadcast. It
seems that we used to wake both threads anyway ;-)
The test still fails because the yield thread yields many more
times that the one that spin waits.
I think the mutex and CV should also be initialised using the
proper static initialiser for the code to be truly correct.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Jeremy F. <je...@go...> - 2005-01-10 16:34:19
|
On Mon, 2005-01-10 at 09:45 +0000, Tom Hughes wrote: > Right. I've tried it on a range of systems now. Initial comments > are that it all looks pretty nice but I have a few problems... Download a new copy of the patch; I've fixed some things since the first announce. > - First up, you have used AS_HELP_STRING in configure.in but > none of my older systems know that even though they have > newer autoconfs than they originally shipped with. Only a > problem for people using CVS of course. That seems to be a general autoconf problem. I had to install a new autoconf on my RH7.3 system to get the CVS HEAD to build anyway. > - Second, on RH9 the clone used to start threads is not trapped > because CLONE_SYSVSEM is not set. I just added an extra case > for that to the switch. I changed the switch to only look at flags we really care about. > The big one is that something is causing it to hang. Different systems > hang in different places but all my systems hang at some point during > the regression tests: > > - RH72 and RH73 hang in corecheck/tests/pth_atfork1 > > - RH80 hangs in corecheck/tests/pth_cancel1 > > - RH9 and FC3 hang in none/tests/x86/yield The yield one is a bug in the test. It should be pthread_cond_broadcast, not signal. But the test is bogus now anyway, since it asserts that "rep;nop" has the same scheduling effect as "sched_yield", which definitely isn't true now. J |
|
From: Nicholas N. <nj...@ca...> - 2005-01-10 20:48:01
|
On Mon, 10 Jan 2005, Jeremy Fitzhardinge wrote: >> - First up, you have used AS_HELP_STRING in configure.in but >> none of my older systems know that even though they have >> newer autoconfs than they originally shipped with. Only a >> problem for people using CVS of course. > > That seems to be a general autoconf problem. I had to install a new > autoconf on my RH7.3 system to get the CVS HEAD to build anyway. I've seen that problem before too -- IIRC, I handled it by just writing the string directly instead of using AS_HELP_STRING; it may not be the "proper" way, but it works, and it's only an error msg in the autconf output that is slightly affected. N |
|
From: Tom H. <th...@cy...> - 2005-01-10 10:17:42
|
In message <yek...@au...>
Tom Hughes <th...@cy...> wrote:
> - RH72 and RH73 hang in corecheck/tests/pth_atfork1
>
> - RH80 hangs in corecheck/tests/pth_cancel1
The RH80 hang is LDT related, and I suspect the other one is the
same but I haven't checked it. I assume this is the problem that
you were talking about Jeremy?
The problem is that the LDT needs to be copied from the parent
thread when a child is created. That is done by VGA_(setup_child)
but that is no longer being called.
If you change do_clone() and add this line just before the
code to handle SETTLS then it should work:
VGA_(setup_child)(&ctst->arch, &ptst->arch);
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Nicholas N. <nj...@ca...> - 2005-01-10 13:33:23
|
On Thu, 6 Jan 2005, Jeremy Fitzhardinge wrote: > All this has resulted in Valgrind shrinking by about 6600 lines of code, > including all of vg_proxylwp.c, vg_libpthread.c and about 2200 lines > from vg_scheduler.c - some of the hairest, trickiest and error-prone > code in the codebase. Woo! > There are some disadvantages though. In losing vg_libpthread, Valgrind > has a much more limited understanding of what's going on inside a > threaded program. It can see threads starting and exiting, but that's > ... > Probably the best way to fix this is to work with the glibc people to > try to build some Valgrind support into glibc, and/or take advantage of > the existing hooks which gdb uses. I haven't investigated this at all > yet. Relying on glibc doesn't sound so good to me -- it assumes the program uses glibc and not another way, and will also not be any good for existing versions of glibc. Feels to me like the right way to do this is to get function wrapping working somehow, ie. allowing Valgrind tools to augment functions like pthread_mutex_lock() with some extra code (this is different to function replacement, which we currently use extensively; we don't want to have to duplicate the functionality of the original function). I can't remember if this has been discussed previously (probably) and if so, what the general opinion was about how easy/reliable it is. > Other changes: I took advantage of the general upheaval to try to > isolate more of the OS/arch/platform dependent code. Sounds good. Hopefully it will hold up to a OS-porting effort. I second Julian in saying that it would be nice to see a bit more polishing before it goes in. Looking at the patch, there are a few parts that say "XXX" or "must fix this later"... it would be good to fix these, otherwise they're likely to stay that way for a while. And the things Tom has found, obviously. Also, there's the question of how this will interact with Julian's private VCPU changes. It could be a tricky merge. I think it's great you've got this working! So much nasty code removed, it's great :) I'd love to know how many of the open bugs will eventually be able to be closed by these changes... and self-hosting will be awesome too. Great stuff! N |
|
From: Jeremy F. <je...@go...> - 2005-01-10 16:44:18
|
On Mon, 2005-01-10 at 13:33 +0000, Nicholas Nethercote wrote: > Feels to me like the right way to do this is to get function wrapping > working somehow, ie. allowing Valgrind tools to augment functions like > pthread_mutex_lock() with some extra code (this is different to function > replacement, which we currently use extensively; we don't want to have > to duplicate the functionality of the original function). I can't > remember if this has been discussed previously (probably) and if so, what > the general opinion was about how easy/reliable it is. Well, to be really useful, we need to determine some internal state of the pthreads library. With pthread_mutex_lock, we want to know when its about to block trying to take a lock, as well as when it has actually taken the lock. The first one is tricky, and not easily derivable from plain wrapping (unless we run a parallel simulation of pthreads to model what each particular call is going to do, which sounds fragile). > I second Julian in saying that it would be nice to see a bit more > polishing before it goes in. Looking at the patch, there are a few parts > that say "XXX" or "must fix this later"... it would be good to fix these, > otherwise they're likely to stay that way for a while. I'll review them. > And the things Tom > has found, obviously. Yep. > Also, there's the question of how this will interact with Julian's private > VCPU changes. It could be a tricky merge. I made almost no changes to vg_to/from_ucode, so I'm assuming that there's not much interaction. J |
|
From: Jeremy F. <je...@go...> - 2005-01-10 16:45:57
|
On Mon, 2005-01-10 at 10:17 +0000, Tom Hughes wrote: > In message <yek...@au...> > Tom Hughes <th...@cy...> wrote: > > > - RH72 and RH73 hang in corecheck/tests/pth_atfork1 > > > > - RH80 hangs in corecheck/tests/pth_cancel1 > > The RH80 hang is LDT related, and I suspect the other one is the > same but I haven't checked it. I assume this is the problem that > you were talking about Jeremy? > > The problem is that the LDT needs to be copied from the parent > thread when a child is created. That is done by VGA_(setup_child) > but that is no longer being called. > > If you change do_clone() and add this line just before the > code to handle SETTLS then it should work: > > VGA_(setup_child)(&ctst->arch, &ptst->arch); Oh, good. For completely correctness, shouldn't they be shared? If one thread later uses an LDT-changing call, should that be reflected in the other threads? J |
|
From: Tom H. <th...@cy...> - 2005-01-10 17:00:47
|
In message <1105375498.29068.48.camel@localhost.localdomain>
Jeremy Fitzhardinge <je...@go...> wrote:
> On Mon, 2005-01-10 at 10:17 +0000, Tom Hughes wrote:
>
>> The problem is that the LDT needs to be copied from the parent
>> thread when a child is created. That is done by VGA_(setup_child)
>> but that is no longer being called.
>>
>> If you change do_clone() and add this line just before the
>> code to handle SETTLS then it should work:
>>
>> VGA_(setup_child)(&ctst->arch, &ptst->arch);
>
> Oh, good. For completely correctness, shouldn't they be shared? If one
> thread later uses an LDT-changing call, should that be reflected in the
> other threads?
I'm not sure. I don't think that's my code - the TLS work is mine
but the LDT stuff for linuxthreads already existed.
My memory is that the kernel normally copies the LDT when creating
a new task but I might be wrong.
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|