From: Nicholas N. <nj...@ca...> - 2004-08-06 12:38:21
|
Greetings, A lot of my recent commits have been aimed at simplifying and compacting Valgrind's code. Some of this has been simple cleaning and refactoring, eg. my clean-up of all the cruft that had accumulated in vg_include.h. Other parts required a bit more insight into how things worked, eg. my changes to the way memory is allocated. That's all well and good, but there's one part of Valgrind that worries me more than any other in this respect -- the unholy triumvirate of vg_scheduler.c, vg_signals.c and vg_proxylwp.c. In my opinion, they form the most complex part of Valgrind, the part that has the most bugs, the part that requires the most maintenance in the face of changes to things outside Valgrind, such as glibc and distro quirks. They also represent some of the worst software engineering aspects, in that they are highly coupled and have fairly low cohesion. They also happen to be the part of Valgrind that I understand least, althought that's beside the point. I've been looking at them (and to a lesser extent, their part-time accomplices vg_libpthread.c and vg_syscalls.c) a bit, in an attempt to find ways to make them smaller, simpler, and/or less coupled. I haven't had much luck. A large part of the problem is that this stuff -- threads, signals and blocking syscalls -- is just really complicated; it's unavoidable that the code dealing with these aspects is going to be nasty in places. But I still think there must be room for improvement. I've certainly been able to improve some other areas, but this part defeated me because it's too big and tangled. For example, my starting point for the allocation improvements was the call-graph for the involved functions; the call-graph for the threads/signals/etc stuff was too big and painful to work out by hand. (BTW, does anyone know any good call-graph drawing programs? I couldn't find any that worked on my machine.) One critical factor is that all these files spend a lot of time fiddling about with the ThreadState of different threads. ThreadState is a huge, ugly data structure. Large parts of it are unavoidable -- every thread must have a stack, an alt-stack, a clean-up stack, TLS, a sigmask, an ldt, saved reg state, etc -- that's just a consequence of threads being complex in general. But I wonder if other parts could be removed or simplified in some way? Or if fewer pieces of code could be made to touch this structure? Too many pieces of code just jump in and fiddle about with it. Another factor is that the proxyLWP stuff, while undoubtedly very clever, duplicates a lot of the code in vg_signals.c and vg_scheduler.c. Well, not quite duplicates, but does stuff that is annoyingly similar. Perhaps this situation could be improved? And it's pretty complicated in many ways too; just look at do_pthread_syscall() and the reason for its existence. Yuk. The current work being done on architecture ports should help clean up and factor out a lot of the stuff on that front. I'm fairly confident architecture abstraction can be done cleanly and without disturbing things too much. Our experience with vg_to_ucode.c and vg_from_ucode.c have shown that the processor stuff, although reasonably complicated, is sufficiently self-contained and unchanging that it doesn't cause many problems. Once it's correct, it tends to stay correct. In comparison, I'm not so confident about the OS ports side, because that stuff has three nasty properties: it's complicated, not neatly contained, and it does change. (This brings to mind fine words from Julian, which I paraphrase: "contained complexity is ok; distributed complexity is death".) It also concerns me every time some obscure bug gets fixed, and the patch adds another 50, 100, or 200 lines of code to Valgrind. Valgrind grows mostly by evolving, like any other open-source project. I like to cut some clean paths through the accreted jungle with a machete. This message has a couple of aims: - to provoke ideas/discussion about specific improvements to the current vg_scheduler.c/vg_signals.c/vg_proxylwp.c mess; - to provoke more general, long-term ideas/discussion about modularity, complexity, etc. I'm not meaning to criticise anyone or their code. I'm not really interested in hearing explanations about why a particular piece of code works the way it is, eg. "feature X is required because of Y and Z", unless they somehow feed into the next point... I am interested in hearing ideas about how things can be improved, eg. "A and B have a lot in common, if we factor the commonality out it will cut 100 lines of code and make C easier to understand", or "I noticed that we no longer need D since we changed E", or "merging F and G means that we can convert H from a global function into a local one", or higher-level things. Ideas are welcome; let 'em rip. N |
From: Jeremy F. <je...@go...> - 2004-08-07 22:25:03
|
On Fri, 2004-08-06 at 13:38 +0100, Nicholas Nethercote wrote: > I've been looking at them (and to a lesser extent, their part-time > accomplices vg_libpthread.c and vg_syscalls.c) a bit, in an attempt to > find ways to make them smaller, simpler, and/or less coupled. I haven't > had much luck. A large part of the problem is that this stuff -- threads, > signals and blocking syscalls -- is just really complicated; it's > unavoidable that the code dealing with these aspects is going to be nasty > in places. Yup. Valgrind is in the awkward position of having to track the kernel's model of signals, threads and system calls. This is just all very complex. We've made the job more complex for ourselves by insisting that all the user code runs in a single threads, since that introduces the extra complexity of the proxyLWP stuff. One of my design goals was to use as much of the kernel machinery as possible so we don't have to emulate it - that's what the proxyLWPs are for, but it wasn't as successful as I'd hoped for. If we could work out a way to get all the other behaviours we want, running the app threads in their own kernel threads would probably simplify a lot of this because we could just use the kernel directly, rather than having to emulate it quite so much. > But I still think there must be room for improvement. I've certainly been > able to improve some other areas, but this part defeated me because it's > too big and tangled. For example, my starting point for the allocation > improvements was the call-graph for the involved functions; the > call-graph for the threads/signals/etc stuff was too big and painful to > work out by hand. (BTW, does anyone know any good call-graph drawing > programs? I couldn't find any that worked on my machine.) > > One critical factor is that all these files spend a lot of time fiddling > about with the ThreadState of different threads. ThreadState is a huge, > ugly data structure. Large parts of it are unavoidable -- every thread > must have a stack, an alt-stack, a clean-up stack, TLS, a sigmask, an ldt, > saved reg state, etc -- that's just a consequence of threads being complex > in general. But I wonder if other parts could be removed or simplified in > some way? Or if fewer pieces of code could be made to touch this > structure? Too many pieces of code just jump in and fiddle about with it. Well, they all fiddle with different pieces of it. For notation's sake, it might be worth breaking it into separate sub-structures: scheduler state, pthreads state, syscall/signal state, CPU state. Apart from the pthreads state, I can't see anything which doesn't need to be there. The structure is ~900 bytes, and 512 bytes of that is sse state. > Another factor is that the proxyLWP stuff, while undoubtedly very clever, > duplicates a lot of the code in vg_signals.c and vg_scheduler.c. Well, > not quite duplicates, but does stuff that is annoyingly similar. Perhaps > this situation could be improved? Could you be a bit more specific? > And it's pretty complicated in many > ways too; just look at do_pthread_syscall() and the reason for its > existence. Yuk. I would love to get rid of it, but it's basically a workaround for a bug in the Unix API. There's just no way to atomically unblock some signals and run a syscall. > The current work being done on architecture ports should help clean up and > factor out a lot of the stuff on that front. I'm fairly confident > architecture abstraction can be done cleanly and without disturbing things > too much. Our experience with vg_to_ucode.c and vg_from_ucode.c have > shown that the processor stuff, although reasonably complicated, is > sufficiently self-contained and unchanging that it doesn't cause many > problems. Once it's correct, it tends to stay correct. > > In comparison, I'm not so confident about the OS ports side, because that > stuff has three nasty properties: it's complicated, not neatly contained, > and it does change. Yes, that worries me too. There are some things which we can be reasonably sure will be the same across all Unix-like systems (eg, they'll all have open/read/write/close syscalls), but the whole threads/ signals stuff could be quite different. And some Unix-like systems, like MacOS X, may have large chunks very non-Unix behaviour (all the Mach stuff, in MacOS X's instance). As you say, CPUs are relatively fixed, well defined and well documented, but OSs are not (and operating environments, ie, including the libraries, are really very variable). > This message has a couple of aims: > > - to provoke ideas/discussion about specific improvements to the current > vg_scheduler.c/vg_signals.c/vg_proxylwp.c mess; > > - to provoke more general, long-term ideas/discussion about modularity, > complexity, etc. > > I'm not meaning to criticise anyone or their code. > > I'm not really interested in hearing explanations about why a particular > piece of code works the way it is, eg. "feature X is required because of Y > and Z", unless they somehow feed into the next point... > > I am interested in hearing ideas about how things can be improved, eg. "A > and B have a lot in common, if we factor the commonality out it will cut > 100 lines of code and make C easier to understand", or "I noticed that we > no longer need D since we changed E", or "merging F and G means that we > can convert H from a global function into a local one", or higher-level > things. > > Ideas are welcome; let 'em rip. Well, I think a big obvious first step is to drop all the pthreads support from the core, which will simplify vg_scheduler a fair bit. The scheduler doesn't really need to know much about a thread's state other than its running or blocked on some event (which is probably always syscall completion of some kind). This means that to do threading, all OS ports must do enough to support the native threads library. The downside is that this could result in as much complexity as dropping core ptheads support removes, and it would be duplicated for each OS port. But it does mean that we don't have to carry libpthread around, which is a large and increasing burden (and that's just for all the x86-linux variants). Also, one general comment about the refactoring. It's obviously a good idea, and you've done a really good job so far. But we have to be careful that when things get merged together, that they're actually semantically the same rather than just coincidentally the same. For example, if Linux and BSD happen to have the same shaped, say, signal delivery structure, that doesn't mean we should use the same structure definition and delivery functions for both. Their similarity is just a coincidence, and we don't want to introduce a false coupling between those implementations. J |
From: Julian S. <js...@ac...> - 2004-08-09 18:29:43
|
On Saturday 07 August 2004 23:24, Jeremy Fitzhardinge wrote: > One of my design goals was to use as much of the kernel machinery as > possible so we don't have to emulate it - that's what the proxyLWPs are > for, but it wasn't as successful as I'd hoped for. In what way did it fall short of what you hoped to achieve? > Well, I think a big obvious first step is to drop all the pthreads > support from the core, which will simplify vg_scheduler a fair bit. The > scheduler doesn't really need to know much about a thread's state other > than its running or blocked on some event (which is probably always > syscall completion of some kind). That sounds good to me; I'll vote in favour. When you say that "the scheduler doesn't really need to know much about a thread's state other than its running or blocked on some event" I wonder if that could be done using the discrete-event-framework I suggested in a message half an hour ago. You can think of this as the OS-specific module placing an event in the event queue, which the core understands, to say "thread X is now blocked/runnable". I guess a good question is, what is the minimal set of thread events the core needs to know about? * thread creation / destruction * changes into / out of runnable state If the core is merely scheduling threads under the direction of a seperate OS-support module, does it need to know much more than this? The OS-support module observes and handles all thread-synchronisation events, the net external effect of which is merely to change the runnability status. But hey, I'm merely parroting what you are suggesting anyway. > This means that to do threading, all OS ports must do enough to support > the native threads library. The downside is that this could result in > as much complexity as dropping core ptheads support removes, and it > would be duplicated for each OS port. But it does mean that we don't > have to carry libpthread around, which is a large and increasing burden > (and that's just for all the x86-linux variants). I agree. I would just love to get rid of libpthread. > Also, one general comment about the refactoring. It's obviously a good > idea, and you've done a really good job so far. But we have to be > careful that when things get merged together, that they're actually > semantically the same rather than just coincidentally the same. For > example, if Linux and BSD happen to have the same shaped, say, signal > delivery structure, that doesn't mean we should use the same structure > definition and delivery functions for both. Their similarity is just a > coincidence, and we don't want to introduce a false coupling between > those implementations. Sure. Of course we have to be very careful about the semantics, more so than anything else. J |
From: Jeremy F. <je...@go...> - 2004-08-10 21:30:42
|
On Mon, 2004-08-09 at 19:29 +0100, Julian Seward wrote: > On Saturday 07 August 2004 23:24, Jeremy Fitzhardinge wrote: > > > One of my design goals was to use as much of the kernel machinery as > > possible so we don't have to emulate it - that's what the proxyLWPs are > > for, but it wasn't as successful as I'd hoped for. > > In what way did it fall short of what you hoped to achieve? Well, there's still a lot of gunk in vg_signals.c. One problem is needing to support both 2.4 and 2.6 kernels, since their thread and signal models are very different, so there's almost two distinct code paths in there. If we go to "emulate clone" threading, this will get worse, since 2.4 and 2.6 clone are also quite different. There's also just a lot of careful state management to make sure the right thread has the right signal mask, and the right signal gets delivered to the right place and interrupts the right system call. The proxyLWP stuff, in 2.6, gets the kernel to do signal routing (ie, deciding which thread to hit with the signal), but the rest is still our code. J |
From: Bob F. <bfr...@si...> - 2004-08-10 22:37:20
|
On Tue, 10 Aug 2004, Jeremy Fitzhardinge wrote: > > Well, there's still a lot of gunk in vg_signals.c. One problem is Since Valgrind is apparently full of "gunk" and as OS support is added/extended, the level of "gunk" may increase, has it been considered to use loadable (or compile-time) driver-like modules to try to make sense of it all? If the interfaces can be defined completely, then it should be possible to support new OS versions without inadvertently breaking older versions. There may be more replicated code across modules, but each module should be simpler and easier to maintain. Thoughts? Bob ====================================== Bob Friesenhahn bfr...@si... http://www.simplesystems.org/users/bfriesen |
From: Eric E. <eri...@fr...> - 2004-08-11 00:51:20
|
Hi guys, I'm looking at that interesting discussion, and wonder if it is not possible to do the following thing (but I'm not completely aware of all V internals): - Have each user thread run the instrumented code, and perform all the syscalls as-is (of course wrapped, but with no delegation to other thread) - Only memory allocations/freeing are a bit synchronized, through thread-safe wrapping allocators, which initialize the A and initial V bits. There is no write concurrency on the A bits, and all threads read them without locking. - All memory access are considered non-concurrent, i.e. when a write is made, we can set the memory value and update the 1-1 corresponding V bits. Testing them (except for helgrind) can be done without particular synchronization. - Atomic instructions are jit'ed to atomic instructions; the test/updates of the V bits does not need to be serialized, only helgrind will handle the translations of these instructions a bit differently. - Signals are handled normally. When a syscall blocks in the kernel, that's ok, it blocks the current thread. Of course, all POSIX signal handling functions (or syscalls) may need to be wrapped, so V knows which threads are interested in which signals, and may update some internal stuff in sig handler wrappers (like setting a flag telling that the thread is in a sig handler and must not do certain calls...) - When a user thread requires some code to be instrumented, it synchronizes with other threads, so that only one performs the instrumentation of that chunk of code. - Error logging and code instrumentation requests are done through a very restricted set of thread-safe calls, which perform adequate locking and optionally delegate that to a dedicated non-user thread, which can have its own signal mask, so as not to interfere with user threads. This seems fairly simple, and (if I'm not wrong and it works...) would ensure that: - applications behave exactly as if they were not instrumented - simplifies a lot signal handling stuff - simplifies a lot the thread structure - removes syscall delegation - ensure that there are few false errors (if we update the V bits after the memory) or few missed errors (if updated before), and only for threaded processes - instrumented apps will certainly run really faster, due to fewer (in fact none) context switches when doing syscalls - most valgrind code can be written non thread-safe, as for most tools code, since they are protected by the restricted thread-safe apis - it should work on all platforms... Just tell me if I missed something, or if I should review my unix manuals... Cheers -- Eric |
From: Jeremy F. <je...@go...> - 2004-08-11 01:25:13
|
On Wed, 2004-08-11 at 02:51 +0200, Eric Estievenart wrote: > Hi guys, > I'm looking at that interesting discussion, and wonder if it is not > possible to do the following thing (but I'm not completely aware > of all V internals): > > - Have each user thread run the instrumented code, and perform all > the syscalls as-is (of course wrapped, but with no delegation > to other thread) > > - Only memory allocations/freeing are a bit synchronized, > through thread-safe wrapping allocators, which initialize > the A and initial V bits. > There is no write concurrency on the A bits, > and all threads read them without locking. This would mean that the RMW operations on the Tool's metadata would not be atomic with respect to other threads. This means that your results would be non-deterministic if there is contention. You could argue that any code which has non-interlocked shared memory accesses is not deterministic anyway, but on a UP system, RMW instructions are atomic with respect to interrupts, which is not something we can enforce after they've been through Valgrind's instruction wringer. On SMP systems you need the LOCK prefix to get atomic operations, but we don't really implement it (though we could). > - All memory access are considered non-concurrent, i.e. > when a write is made, we can set the memory value and update the > 1-1 corresponding V bits. Testing them (except for helgrind) > can be done without particular synchronization. Why? > - Atomic instructions are jit'ed to atomic instructions; > the test/updates of the V bits does not need to be serialized, > only helgrind will handle the translations of these instructions > a bit differently. Why wouldn't V state need serializing? > - Signals are handled normally. > When a syscall blocks in the kernel, that's ok, it blocks > the current thread. > Of course, all POSIX signal handling functions (or syscalls) > may need to be wrapped, so V knows which threads are interested > in which signals, and may update some internal stuff in > sig handler wrappers (like setting a flag telling that the > thread is in a sig handler and must not do certain calls...) That's hard to determine in general - you can't easily tell if a process has left the handler with longjmp. > - When a user thread requires some code to be instrumented, it > synchronizes with other threads, so that only one > performs the instrumentation of that chunk of code. I think we can put a big fat lock around the whole inside of the core without much problem (ie, all the translation machinery). > - Error logging and code instrumentation requests are done > through a very restricted set of thread-safe calls, > which perform adequate locking and optionally delegate > that to a dedicated non-user thread, which can have its > own signal mask, so as not to interfere with user threads. Error reporting is already done through a fairly standard interface. > This seems fairly simple, and (if I'm not wrong and it works...) would > ensure that: > - applications behave exactly as if they were not instrumented Hm. All RMW instructions are atomic on a single-processor machine, so any code which assumes this would break unless we're careful to maintain that property - which could be expensive. > - simplifies a lot signal handling stuff > - simplifies a lot the thread structure > - removes syscall delegation > - ensure that there are few false errors (if we update the V bits > after the memory) or few missed errors (if updated before), > and only for threaded processes This is the biggest concern - these false/missed reports would be very non-deterministic. Basically, it would mean that a tool's model of the state of the system would start to drift away from the real state of the system, in a non- deterministic way. For something like memcheck, where V state is copied around, it could mean there's a large explosion in state drift. > - instrumented apps will certainly run really faster, due to > fewer (in fact none) context switches when doing syscalls It probably isn't a good idea to make sweeping statements like that - Valgrind has been very good at behaving in very unexpected ways performance-wise. It isn't likely to hurt performance though. > - most valgrind code can be written non thread-safe, as for > most tools code, since they are protected by the restricted > thread-safe apis Anything which is dealing with instrumented code will need to be thread- aware. > - it should work on all platforms... Um, maybe. J |
From: Robert W. <rj...@du...> - 2004-08-08 02:30:34
|
> (BTW, does anyone know any good call-graph drawing=20 > programs? I couldn't find any that worked on my machine.) If you're looking for just a generic graph-drawing package, then this is your man: http://www.research.att.com/sw/tools/graphviz/ It's got an open source license, but I haven't looked at it very carefully, so I don't know the exact terms. However, I can attest to the results - they're really nice. We use them at work for plotting flows through our scientific codes. Regards, Robert. --=20 Robert Walsh Amalgamated Durables, Inc. - "We don't make the things you buy." Email: rj...@du... |
From: Bob F. <bfr...@si...> - 2004-08-08 04:17:44
|
I obtained a script from somewhere which uses gprof output to generate a weighted call-graph using graphviz. Bob On Sat, 7 Aug 2004, Robert Walsh wrote: >> (BTW, does anyone know any good call-graph drawing >> programs? I couldn't find any that worked on my machine.) > > If you're looking for just a generic graph-drawing package, then this is > your man: > > http://www.research.att.com/sw/tools/graphviz/ > > It's got an open source license, but I haven't looked at it very > carefully, so I don't know the exact terms. However, I can attest to > the results - they're really nice. We use them at work for plotting > flows through our scientific codes. > > Regards, > Robert. > > -- > Robert Walsh > Amalgamated Durables, Inc. - "We don't make the things you buy." > Email: rj...@du... > ====================================== Bob Friesenhahn bfr...@si... http://www.simplesystems.org/users/bfriesen |
From: Nicholas N. <nj...@ca...> - 2004-08-09 13:52:02
|
On Sat, 7 Aug 2004, Bob Friesenhahn wrote: > I obtained a script from somewhere which uses gprof output to generate a > weighted call-graph using graphviz. Bob Thanks for the suggestion. Unfortunately, Valgrind and gprof don't interact very happily, ie. Valgrind tends to crash when even parts of it are compiled with -pg. A call-graph extractor that works statically on the source code would be better... N |
From: Bob F. <bfr...@si...> - 2004-08-09 15:19:24
|
On Mon, 9 Aug 2004, Nicholas Nethercote wrote: > On Sat, 7 Aug 2004, Bob Friesenhahn wrote: > >> I obtained a script from somewhere which uses gprof output to generate a >> weighted call-graph using graphviz. Bob > > Thanks for the suggestion. Unfortunately, Valgrind and gprof don't interact > very happily, ie. Valgrind tends to crash when even parts of it are compiled > with -pg. A call-graph extractor that works statically on the source code > would be better... Using gprof is pretty involved anyway. I was suggesting graphviz as a useful tool which can be used elsewhere by valgrind. The gprof-based tool was just an example. One way graphviz could be used is to provide a summary of call stack vs leaks or (unreleased memory) as an alternative to the regular valgrind memcheck output. This would help the user correlate multiple reports. The figure would show a summary call graph resulting in unreleased memory allocations. In many situations, a common bug can result in many individual reports (e.g. if a C++ class destructor fails to release internal allocations or the destructor is never called). If redundant stack information is eliminated, the resulting digraph would make it very obvious where the memory allocations were made. This would save the user a lot of time, and would be more attractive than the normal valgrind output. Bob ====================================== Bob Friesenhahn bfr...@si... http://www.simplesystems.org/users/bfriesen |
From: Nicholas N. <nj...@ca...> - 2004-08-09 14:11:19
|
On Sat, 7 Aug 2004, Jeremy Fitzhardinge wrote: >> One critical factor is that all these files spend a lot of time fiddling >> about with the ThreadState of different threads. > > Apart from the pthreads state, I can't see anything which doesn't need > to be there. The structure is ~900 bytes, and 512 bytes of that is sse > state. My concern isn't the size in bytes, rather the number of distinct fields. >> Another factor is that the proxyLWP stuff, while undoubtedly very clever, >> duplicates a lot of the code in vg_signals.c and vg_scheduler.c. Well, >> not quite duplicates, but does stuff that is annoyingly similar. Perhaps >> this situation could be improved? > > Could you be a bit more specific? Maybe "duplicates" is not quite the right word... a lot of functions are called both from vg_proxylwp.c and vg_scheduler.c. A couple of examples: - VG_(do_signal_routing)() - VG_(deliver_signal)() - VG_(is_sig_ign)() - VG_(proxy_create)() - VG_(proxy_set_sigmask)() etc. I'm not passing judgment on these particular examples, they're just ones I saw. I guess what I was trying to say: there seem to be a number of actions whereby vg_scheduler.c or vg_signals.c handles the non-blocking-syscall case, and vg_proxylwp.c handles the blocking syscall case, and the two cases are similar but not identical. do_syscall/do_pthread_syscall is one example. ---- > One of my design goals was to use as much of the kernel machinery as > possible so we don't have to emulate it - that's what the proxyLWPs are > for, but it wasn't as successful as I'd hoped for. If we could work out > a way to get all the other behaviours we want, running the app threads > in their own kernel threads would probably simplify a lot of this > because we could just use the kernel directly, rather than having to > emulate it quite so much. > > [...] > > Well, I think a big obvious first step is to drop all the pthreads > support from the core, which will simplify vg_scheduler a fair bit. The > scheduler doesn't really need to know much about a thread's state other > than its running or blocked on some event (which is probably always > syscall completion of some kind). > > This means that to do threading, all OS ports must do enough to support > the native threads library. The downside is that this could result in > as much complexity as dropping core ptheads support removes, and it > would be duplicated for each OS port. But it does mean that we don't > have to carry libpthread around, which is a large and increasing burden > (and that's just for all the x86-linux variants). Are these two points (above and below the "[...]") the same? Ok, maybe this is the direction we should be heading in, then. So, what's preventing us from doing so? AFAIK: - We can't lose control of scheduling totally, because V turns atomic instructions into non-atomic ones. We don't want a thread switch after a shadow memory word is written, but before the word itself is written. - We need a way of identifying when certain interesting thread/scheduling operations occur, such as mutex locks, thread switches, etc, to support Helgrind, Calltree, and future tools. Do we have solutions for these, and are there any other problems? N |
From: Julian S. <js...@ac...> - 2004-08-09 18:43:43
|
> > Well, I think a big obvious first step is to drop all the pthreads > > support from the core, which will simplify vg_scheduler a fair bit. The > > scheduler doesn't really need to know much about a thread's state other > > than its running or blocked on some event (which is probably always > > syscall completion of some kind). > > > > This means that to do threading, all OS ports must do enough to support > > the native threads library. The downside is that this could result in > > as much complexity as dropping core ptheads support removes, and it > > would be duplicated for each OS port. But it does mean that we don't > > have to carry libpthread around, which is a large and increasing burden > > (and that's just for all the x86-linux variants). > Ok, maybe this is the direction we should be heading in, then. So, what's > preventing us from doing so? AFAIK: > > - We can't lose control of scheduling totally, because V turns atomic > instructions into non-atomic ones. We don't want a thread switch after > a shadow memory word is written, but before the word itself is written. Agreed. So the suggested solution is to maintain single-threadedness but instead of imposing by intercepting and simulating the pthreads interface, we intercept and simulate at the clone() level. As Jeremy says it would be best to run app threads in kernel threads, iow allow genuinely concurrent instrumented execution. I can't say that I believe the problems that raises, with locking the shadow state, at reasonable performance overhead, are soluble -- although I would love to be proved wrong. So for the moment, my view is we still have to impose sequential execution, but lower the level at which this is done. > - We need a way of identifying when certain interesting thread/scheduling > operations occur, such as mutex locks, thread switches, etc, to support > Helgrind, Calltree, and future tools. Yes. What is needed is to intercept the relevant functions (pth_mutex_lock etc), note they have happened, but still allow the native thread library to handle it. The downside is we'd then need at least some simulation of the pthreads machinery in order to detect errors, but the good thing is we can make it as weedy as we like, commensurate with the errors we wish to detect. By contrast, currently our pthreads simulation has to be accurate enough to actually make the program work correctly, and we can't weaken it in any way. At least in theory, if people then want to run different threading schemes (foo_threads, for example) we will then be able to at least run a foo-threaded program, so long as it is underlyingly based on clone(). J |
From: Tom H. <th...@cy...> - 2004-08-09 23:15:26
|
In message <200...@ac...> Julian Seward <js...@ac...> wrote: > > Ok, maybe this is the direction we should be heading in, then. So, what's > > preventing us from doing so? AFAIK: > > > > - We can't lose control of scheduling totally, because V turns atomic > > instructions into non-atomic ones. We don't want a thread switch after > > a shadow memory word is written, but before the word itself is written. > > Agreed. So the suggested solution is to maintain single-threadedness > but instead of imposing by intercepting and simulating the pthreads > interface, we intercept and simulate at the clone() level. I certainly think this is a good aim as it would allow us to drop all of vg_libpthread.c (a really good idea) and much of vg_scheduler.c as well I suspect. In fact we had this working once for wine support because wine created threads using clone directly. My colleage Adam Gundy had a solution that involved changing run_thread_for_a_while so that if the thread to run was not the executing one then the executing thread would send a signal to the thread to be run and then call sigsuspend so that it blocked until such as the scheduler chose to run it and woke it up with a signal. We dropped that from the wine stuff after wine implemented an alternate pthread based solution, but I got Adam to dig it out again last Friday after Nick started this thread and am looking at whether it can be resurrected. > As Jeremy says it would be best to run app threads in kernel threads, iow > allow genuinely concurrent instrumented execution. I can't say that I > believe the problems that raises, with locking the shadow state, at > reasonable performance overhead, are soluble -- although I would love to > be proved wrong. We have discussed it before, but I suspect it would be very hard. > Yes. What is needed is to intercept the relevant functions (pth_mutex_lock > etc), note they have happened, but still allow the native thread library to > handle it. The downside is we'd then need at least some simulation of the > pthreads machinery in order to detect errors, but the good thing is we can > make it as weedy as we like, commensurate with the errors we wish to detect. > By contrast, currently our pthreads simulation has to be accurate enough to > actually make the program work correctly, and we can't weaken it in any way. This is why I'm interested in the wrapper stuff that I think somebody is working on that would hopefully allow valgrind to do some processing on entry to/exit from a function in the client program. I would be inclined to consider dropping all the pthread error checking from the core and push it into a pthread validation tool. Obviously helgrind also needs to notice lock/unlock events. Tom -- Tom Hughes (th...@cy...) Software Engineer, Cyberscience Corporation http://www.cyberscience.com/ |
From: Bryan O'S. <bo...@se...> - 2004-08-09 23:25:31
|
On Mon, 2004-08-09 at 23:59 +0100, Tom Hughes wrote: > My understanding is that this was looked at quite carefully during > the NPTL development. In fact the original plan was to go to a multi > level model but from what I've read experiment showed that the kernel > could be made to scale to vast numbers of thread/process without any > loss of performance. I think you're thinking of NGPT, IBM's parallel effort to replace LinuxThreads, which ended up being abandoned. <b |
From: Tom H. <th...@cy...> - 2004-08-10 06:10:32
|
In message <109...@se...> Bryan O'Sullivan <bo...@se...> wrote: > On Mon, 2004-08-09 at 23:59 +0100, Tom Hughes wrote: > > > My understanding is that this was looked at quite carefully during > > the NPTL development. In fact the original plan was to go to a multi > > level model but from what I've read experiment showed that the kernel > > could be made to scale to vast numbers of thread/process without any > > loss of performance. > > I think you're thinking of NGPT, IBM's parallel effort to replace > LinuxThreads, which ended up being abandoned. The paper I was thinking of is about NPTL but some of the experience it talks about did come from NGPT I think. The paper is at: http://people.redhat.com/drepper/nptl-design.pdf Tom -- Tom Hughes (th...@cy...) Software Engineer, Cyberscience Corporation http://www.cyberscience.com/ |
From: Jeremy F. <je...@go...> - 2004-08-10 07:46:18
|
On Mon, 2004-08-09 at 15:11 +0100, Nicholas Nethercote wrote: > Maybe "duplicates" is not quite the right word... a lot of functions are > called both from vg_proxylwp.c and vg_scheduler.c. A couple of examples: > > - VG_(do_signal_routing)() > - VG_(deliver_signal)() > - VG_(is_sig_ign)() > - VG_(proxy_create)() > - VG_(proxy_set_sigmask)() > > etc. I'm not passing judgment on these particular examples, they're just > ones I saw. > > I guess what I was trying to say: there seem to be a number of actions > whereby vg_scheduler.c or vg_signals.c handles the non-blocking-syscall > case, and vg_proxylwp.c handles the blocking syscall case, and the two > cases are similar but not identical. do_syscall/do_pthread_syscall is one > example. > > ---- > > > One of my design goals was to use as much of the kernel machinery as > > possible so we don't have to emulate it - that's what the proxyLWPs are > > for, but it wasn't as successful as I'd hoped for. If we could work out > > a way to get all the other behaviours we want, running the app threads > > in their own kernel threads would probably simplify a lot of this > > because we could just use the kernel directly, rather than having to > > emulate it quite so much. > > > > [...] > > > > Well, I think a big obvious first step is to drop all the pthreads > > support from the core, which will simplify vg_scheduler a fair bit. The > > scheduler doesn't really need to know much about a thread's state other > > than its running or blocked on some event (which is probably always > > syscall completion of some kind). > > > > This means that to do threading, all OS ports must do enough to support > > the native threads library. The downside is that this could result in > > as much complexity as dropping core ptheads support removes, and it > > would be duplicated for each OS port. But it does mean that we don't > > have to carry libpthread around, which is a large and increasing burden > > (and that's just for all the x86-linux variants). > > Are these two points (above and below the "[...]") the same? The two points are orthogonal. The first is about internal implementation, and the second is about the interface we present. The first point is about whether we use kernel thread contexts directly for client thread contexts. Ie, do we do our own thread switching, or let the kernel do it? The second is whether we present threading to the clients in terms of pthreads, or the kernel's underlying thread abstraction (clone, in Linux's case). > > Ok, maybe this is the direction we should be heading in, then. So, what's > preventing us from doing so? AFAIK: > > - We can't lose control of scheduling totally, because V turns atomic > instructions into non-atomic ones. We don't want a thread switch after > a shadow memory word is written, but before the word itself is written. Well, the only way to do this is either by putting locks around all the critical sections, or by using lock-free algorithms. The first is expensive, and the second is v. complex in general. I've been thinking about some scheme for doing adaptive locking, where only memory accesses which end up touching memory shared between threads get locking, but most unshared memory accesses are unlocked. It's tricky, and not necessarily a performance win. > - We need a way of identifying when certain interesting thread/scheduling > operations occur, such as mutex locks, thread switches, etc, to support > Helgrind, Calltree, and future tools. I think we can do this by hooking into library function calls and observing the client behaviour. For example, if we see the client call pthread_mutex_lock, and then that thread ends up blocking in sys_futex before returning, we can deduce that it blocked in a lock. It could get fragile though. Or we could maintain a parallel model of the libpthread state, but that also seems fragility-prone. J |
From: Nicholas N. <nj...@ca...> - 2004-08-09 15:53:15
|
On Mon, 9 Aug 2004, Josef Weidendorfer wrote: > Check out Calltree from Joerg Schilling (As that's older, I will rename my > Valgrind tool to Callgrind with the next release); it gives a static call > graph for C code, and is able to produce GraphViz format. Ah, got it, and it works! Thanks, Josef. N |
From: Julian S. <js...@ac...> - 2004-08-09 17:44:34
|
> I am interested in hearing ideas about how things can be improved, eg. "A > and B have a lot in common, if we factor the commonality out it will cut > 100 lines of code and make C easier to understand", or "I noticed that we > no longer need D since we changed E", or "merging F and G means that we > can convert H from a global function into a local one", or higher-level > things. I think cleaning up the signals/syscalls/threads stuff -- what I think of as the "environment simulation" -- is hugely important, and I'm pleased to see you looking into it. As you say, the processor emulation stuff is pretty much standard compiler technology and pretty much self-contained, with a bit of care, so the difficulties in it are manageable. I have no silver bullet (otherwise I would have deployed it by now :) but some comments: * One interesting stepping stone which would help explore the design space is to consider if/how it is possible to run the native thread library (nptl, I guess) directly. Instead of intercepting and scheduling at the pthread level, we intercept clone() and schedule the results ourselves. Does that help? Does it help move towards a more modular framework? At least it would get rid of our libpthread, which has to be a good thing. * More generally, I wonder if it helps to try and structure valgrind as a discrete event simulator. Note, this idea is very half-baked and may be utterly irrelevant. A discrete event simulator has a state, and a time-ordered queue of future events. The basic cycle is to take the next-to-happen event off the queue, and do it, which changes the state and/or possibly generates new future events which are enqueued. This is standard stuff in the chip-simulation etc community. So, using C++ inheritance or whatever, you could have some core set of events, with some core state, and each OS could have its own private state and events pertaining to it. The event queue would hold a mixture of the two. Valgrind's core would pull events off the queue, handle the ones it understands directly, and call the relevant OS-specific handler for OS-specific events it doesn't understand. The kinds of events that would be in the queue might be: * deliver a signal * signal delivery done (if that makes sense, I suspect not) * mess with host signals in some way * check to see if some event has happened by some specific time * syscall initiation * syscall termination * run a valgrind thread for a while * low-level synchronisation events between V-scheduled threads The aim is to decouple the V core as much as possible from the env-simulation specifics. For example, imagine we want to valgrindify a raw ARM image (on x86 host, perhaps) which expects to run in some low-level environment, possibly without an OS at all. Then the events it wants to do are nothing at all to do with the Unix syscalls/signals/threads model and so support for the Unix model shouldn't be hardwired into the valgrind core. Ideally, supporting that low-level environment would be done by defining a set of events for it, and giving handlers which "do" those events. Of course the events are opaque to the V core, but that's a good thing. Such a simulator scheme may not be a good way to structure it, but (1) I haven't heard any other suggestions, and (2) it feels to me like it has a kind of generality which is useful for modularising things. So I'd be interested to see if/how far it can be developed. Comments welcomed. Re-reading it, what I am suggesting is an abstraction framework based explicitly on the notions of state and events, and around inheritance/opacity, by which details of events specific to a specific environment (simulation) can be localised in a module supporting that environment, instead of being scattered across the entire code base. Wow, that's a long sentence. * Richard Gabriel ("Patterns of Software") wrote an essay called "Worse is better", and I think we're in that kind of situation now. The current syscall/signal/thread scheme has been refined sufficiently that it works pretty well on Linux. But when I originally started on it (esp the threads stuff) I had no idea if it would succeed. So we should be prepared to consider ideas which look promising but for which not all the details are worked out ahead of time. Hence the result may be worse for a while, but I see no other way of getting a better understanding of the design space. J |
From: Nicholas N. <nj...@ca...> - 2004-08-09 19:59:03
|
On Mon, 9 Aug 2004, Julian Seward wrote: > * Richard Gabriel ("Patterns of Software") wrote an essay called > "Worse is better", and I think we're in that kind of situation now. I've been thinking about this a bit w.r.t. full virtualisation. The benefits of FV over the LD_PRELOAD approach are: 1. Valgrind gains control at absolute startup 2. Valgrind can use the standard libraries itself 3. Valgrind can run static binaries 4. Valgrind is nicely separated from the client All these are good, but they don't to me seem fantastic -- none of them were really huge problems previously. And FV has introduced its own difficulties, mostly from the rigidity of the memory layout, which we're still battling. Plus it results in more code. I'm not saying we shouldn't have done FV. Because, on the other hand, one of the best things about Valgrind from a user's point of view is the "it just works" factor. We really want to keep that as high as possible, so that's kind of an argument against the worse-is-better approach. So what's my point? I'm not sure. N |
From: Jeremy F. <je...@go...> - 2004-08-10 21:04:10
|
On Mon, 2004-08-09 at 18:44 +0100, Julian Seward wrote: > Such a simulator scheme may not be a good way to structure it, but > (1) I haven't heard any other suggestions, and (2) it feels to me like > it has a kind of generality which is useful for modularising > things. So I'd be interested to see if/how far it can be developed. > Comments welcomed. I would argue that this is already there. The interface between vg_scheduler and the rest of the OS-interface code is already pretty thin, and could be cleaned up pretty easily (this is assuming you ignore the pthreads stuff, which is a lot of the apparent complexity). The scheduler loop is basically: for(ever) for(each_thread) if runnable run it if (nothing ran) idle() And idle() is responsible for collecting async events from elsewhere. These events are: * timeout - not really used for anything anymore, except pthread_cond_timewait * signal polling - not really an event, just something which needs to happen periodically on 2.4 kernels * kernel event - a thread woke up from being blocked in a syscall, either because it was interrupted by a signal, or because the syscall completed normally idle(), and vg_scheduler.c, doesn't do anything much with this info - it just uses it as a hint to rescan the thread list and find something runnable. Most of the work (which is the part which has Nick worried, and definitely needs cleaning up) is done by things which directly change the thread state in response to the event. For syscalls, its just "keep running normally"; for signals, it loads a whole new CPU context to run the signal handler. If we move all the CPU-specific stuff out of vg_scheduler.c (the actual context-switching machinery) and drop all the lipthread stuff, then vg_scheduler will essentially be just this event loop, which is nice and simple. Well, the other part is dealing with client requests coming out of the running client instruction stream. This is also a pretty thin interface. If its a client request, then do it; if its a syscall, then just call the arch/OS-specific "do syscall" routine, saying "this thread N has a syscall set up in its context, go do it"; vg_scheduler itself doesn't need to know any other details. Now one thing which particularly worries me is the number of places which touch ThreadState->status directly, mostly because there are so many states. Every place which changes the state needs to make sure that their doing a valid transition in the state machine, and since I don't think there's a formal state machine for the thread state, that's hard to do, and hard to verify. One of the big advantages to me of dropping the pthreads stuff is that we can drop all those ThreadStates associated with them, which will make the state machine much more managable. Then we can check all the places where the thread state changes and make sure they're valid transitions. This would naturally fix some of the bugs we currently have in the state machine (like not being able to deal sensibly with a pthread_mutex_lock being interrupted by a signal and running a signal handler). J |
From: Adun R. <adu...@ml...> - 2004-08-09 19:04:31
|
On Mon, 9 Aug 2004 18:44:25 +0100, "Julian Seward" <js...@ac...> said: > So, using C++ inheritance or whatever, you could have some core set of > events, with some core state, and each OS could have its own private > state and events pertaining to it. The event queue would hold a > mixture > of the two. Valgrind's core would pull events off the queue, handle > the ones it understands directly, and call the relevant OS-specific > handler for OS-specific events it doesn't understand. This may not be the time as I am still learning the subject, but since you mentioned it as a design issue, and as this thread seems to deal with important development steps, I wanted to present a thought of mine - Porting valgrind to windows. Several problems rise with this [and many will rise, as this idea is also, in your words - 'half-baked'] , and a lot of code will have to change, first of all our loading method (no LD_PRELOAD [except for the user32.dll attachment dll's in the registry, but this isn't nearly acceptable), and nothing that is like it - a solution could be to control the client process using the Windows API and the Debugging API) which affects the possibility for symbol "capturing" of malloc, calloc, etc (we can change IAT pointers). Another problem is a whole different threading model. But first thing first - What you're thoughts on this? more importantly - how this relates to how you all see the valgrind project? Regards, Rauch Adun. -- Adun R. adu...@ml... |
From: Nicholas N. <nj...@ca...> - 2004-08-09 19:45:42
|
On Mon, 9 Aug 2004, Adun R. wrote: > This may not be the time as I am still learning the subject, but since > you mentioned it as a design issue, and as this thread seems to > deal with important development steps, I wanted to present a thought of > mine - Porting valgrind to windows. > Several problems rise with this [and many will rise, as this idea is > also, in your words - 'half-baked'] , and a lot of code will have to > change, first of all our loading method (no LD_PRELOAD [except for the > user32.dll attachment dll's in the registry, but this isn't nearly > acceptable), and nothing that is like it - a solution could be to > control the client process using the Windows API and the Debugging API) > which affects the possibility for symbol "capturing" of malloc, calloc, > etc (we can change IAT pointers). Another problem is a whole different > threading model. Note that since 2.1.1, the LD_PRELOAD method is no longer used. My opinion on Windows is that it would be great, if it could be done in a way that doesn't result in Valgrind's code looking like a train crash. AFAIK, all the current developers don't know much about Windows; I sure don't. N |
From: Bob F. <bfr...@si...> - 2004-08-09 19:19:56
|
On Mon, 9 Aug 2004, Julian Seward wrote: > At least in theory, if people then want to run different threading schemes > (foo_threads, for example) we will then be able to at least run a foo-threaded > program, so long as it is underlyingly based on clone(). If Linux eventually wakes up and adopts a multi-level threading model (combination of user-level and on-demand kernel-level threads "LWP"s) as used by Solaris, AIX, and Digital Unix, this approach is surely doomed to fail. Also, if someone wants to port Valgrind to a Unix OS that doesn't support clone() (only Linux has clone()), it seems doomed to fail. If Linux is to ever become formidable in the multi-threading world (which it must!) then it will need to eliminate the requirement for the kernel to know about each and every thread. The latest trend is for CPUs to incorporate multiple execution units, so even systems delivered with one CPU chip will benefit performance-wise from multi-threaded programs. Bob ====================================== Bob Friesenhahn bfr...@si... http://www.simplesystems.org/users/bfriesen |
From: Julian S. <js...@ac...> - 2004-08-09 22:24:54
|
> I've been thinking about this a bit w.r.t. full virtualisation. The > benefits of FV over the LD_PRELOAD approach are: > > 1. Valgrind gains control at absolute startup > 2. Valgrind can use the standard libraries itself > 3. Valgrind can run static binaries > 4. Valgrind is nicely separated from the client > > All these are good, but they don't to me seem fantastic -- none of them > were really huge problems previously. And FV has introduced its own > difficulties, mostly from the rigidity of the memory layout, which we're > still battling. Plus it results in more code. I agree that 2,3,4 are not a big deal, but 1 is. The threading stuff was getting nasty as a lack of 1 -- having the threading machinery start up in an unknown state is not good, so FV is a big win there. In the longer term 2 and 4 are useful too. For example, if we wanted to use C++ in V in a big way, 2 would be more or less mandatory and 4 would be very helpful too. > I'm not saying we shouldn't have done FV. Because, on the other hand, > one of the best things about Valgrind from a user's point of view is the > "it just works" factor. We really want to keep that as high as possible, > so that's kind of an argument against the worse-is-better approach. Re FV, and perhaps the proxyLWP stuff, Jeremy argues that we should use as much of the underlying OSs services as possible, and make V be the thinnest possible layer in between. I've never been entirely comfortable with that philosophy, although I'm not sure I know why I'm not. I'd still like to know the results of having our own software managed MMU and so decoupling the two address spaces completely -- even though, as Jeremy has accurately pointed out several times in the past, that brings its own set of difficulties. I suppose my underlying inclination is to simulate as much stuff as possible and thereby minimise the extent to which we are exposed to the vagaries of the host OS. I think that's why the thin-layer idea is a bit irksome. Perhaps it's a way of thinking about OS-level portability: either simulate a lot of stuff yourself and minimise the "passed-through" requirements, or have a seperate thin-layer arrangement for each supported OS. I lean towards the former scheme, but I don't know if it's really viable -- whereas at least Jeremy has demonstrated that his scheme works well at least for Linux. J |