|
From: Josef W. <Jos...@gm...> - 2003-05-26 09:36:41
|
Hi, currently I'm thinking a little bit of what would be needed to allow applications run under Valgrind to use processors in parallel. The main goal would be to speed up cache simulation for multithreaded applications, more specially first to let OpenMP apps (number crunshing) run simultaneously. I'm not at all convinced if there will be any benefit/speedup at all on multiple processors because of a possible need for additional fine-grained communication among the threads. So perhaps its simple not worth it. To come to this conclusion faster, I wanted to ask you for the problems you see in this for the Valgrind core framework. As I see it: all global data structures accessable by multiple threads either must be avoided or locked on access. * Could the instrumentation engine/translation table be separated for each thread? This would duplicate translation for each thread, but would avoid synchronisation on accessing the translation hash table. * V memory allocation functions have to be multithread-aware. * Signal handling? Is there anything special that I have overlooked? * What's with Valgrinds version of the pthread library? Do you think that it's a big task to make this reentrant-safe? Or perhaps we even could get rid of our own implementation? Thanks for any answers, Josef |
|
From: Nicholas N. <nj...@ca...> - 2003-05-26 10:05:25
|
On Mon, 26 May 2003, Josef Weidendorfer wrote: > currently I'm thinking a little bit of what would be needed to allow > applications run under Valgrind to use processors in parallel. The main goal > would be to speed up cache simulation for multithreaded applications, more > specially first to let OpenMP apps (number crunshing) run simultaneously. > I'm not at all convinced if there will be any benefit/speedup at all on > multiple processors because of a possible need for additional fine-grained > communication among the threads. > > So perhaps its simple not worth it. > To come to this conclusion faster, I wanted to ask you for the problems you > see in this for the Valgrind core framework. > > As I see it: all global data structures accessable by multiple threads either > must be avoided or locked on access. > * Could the instrumentation engine/translation table be separated for each > thread? This would duplicate translation for each thread, but would avoid > synchronisation on accessing the translation hash table. > * V memory allocation functions have to be multithread-aware. > * Signal handling? Is there anything special that I have overlooked? > * What's with Valgrinds version of the pthread library? Do you think that it's > a big task to make this reentrant-safe? Or perhaps we even could get rid of > our own implementation? Julian and I have discussed this, and AFAWCT the killer point is shadow memory for Memcheck -- each time memory is written shadow memory is written a few instructions before. The danger is that if two threads are racing on a memory word, you could get a thread switch in between the shadow write and the real write, and then your shadow memory would not match your real memory. The problem would be avoided if we could guarantee that thread switches only occur between basic blocks. Actually, now that I think about it, that shouldn't be a problem with the current implementation since it does thread scheduling itself, and never does thread switches in the middle of a basic block. Hmm. As for getting rid of Valgrind's threads implementation, the best idea so far is to intercept the clone() syscall, and have Valgrind schedule threads itself but not do all the pthreads ops itself. This sounds plausible, but I think the details haven't been worked out. The big advantage would be getting rid of libpthread.so which is a source of much complexity. The disadvantage would be that the pthreads API error checking would disappear (I think). All the other stuff you mention about making Valgrind thread-safe seem like it shouldn't be too hard to do (famous last words...) All this is very Linux-centric, and a (very) long-term goal would be to support other operating systems, but it's not at all clear how to do this. Does that answer some of your questions? N |
|
From: Josef W. <Jos...@gm...> - 2003-05-26 11:52:25
|
On Monday 26 May 2003 12:05, Nicholas Nethercote wrote: > [...] > Julian and I have discussed this, and AFAWCT the killer point is shadow > memory for Memcheck -- each time memory is written shadow memory is > written a few instructions before. The danger is that if two threads are > racing on a memory word, you could get a thread switch in between the > shadow write and the real write, and then your shadow memory would not > match your real memory. Sidenote (perhaps there's a misunderstanding): I thought about 2 threads running on 2 processors in parallel. There does not have any thread switching for races to occur, as without locking, interleaved memory accesses from the two threads/processors can happen in any order. As memory is a shared resource, we must use locking (atomic access to shadow and real memory). How fine should locking be done? One lock for all memory accesses kills performance. Perhaps one lock for each allocated area / or every 64 bytes of memory? This would need very fast user mode mutexes (e.g. Futexes from Linux 2.5). But all this is about problems in some specific skin. I currently wonder more about problems with the generic valgrind core (binary translation engine and runtime environment). > The problem would be avoided if we could guarantee that thread switches > only occur between basic blocks. Actually, now that I think about it, > that shouldn't be a problem with the current implementation since it does > thread scheduling itself, and never does thread switches in the middle of > a basic block. Hmm. > > As for getting rid of Valgrind's threads implementation, the best idea so > far is to intercept the clone() syscall, and have Valgrind schedule > threads itself but not do all the pthreads ops itself. This sounds But with the szenario of kernel level threads, Valgrind can't schedule them ?! > plausible, but I think the details haven't been worked out. The big > advantage would be getting rid of libpthread.so which is a source of much > complexity. The disadvantage would be that the pthreads API error > checking would disappear (I think). OK. That's usefull in general. And the full pthreads replacement can be moved to a skin. > All the other stuff you mention about making Valgrind thread-safe seem > like it shouldn't be too hard to do (famous last words...) > > All this is very Linux-centric, and a (very) long-term goal would be to > support other operating systems, but it's not at all clear how to do this. What are the most problematic Linux-centric things to get rid of? I thought a bigger problem would be to support other processor architectures. This seems an interesting goal to me as for skins, the architecture is already hidden by using the RISC-like UCodes. > Does that answer some of your questions? Thanks very much for your response. At least the dimension of this task is somewhat clearer to me now. Josef |
|
From: Nicholas N. <nj...@ca...> - 2003-05-26 12:06:53
|
On Mon, 26 May 2003, Josef Weidendorfer wrote: > Sidenote (perhaps there's a misunderstanding): I thought about 2 threads > running on 2 processors in parallel. There does not have any thread switching > for races to occur, as without locking, interleaved memory accesses from the > two threads/processors can happen in any order. Oh, good point. That's even more difficult. > As memory is a shared resource, we must use locking (atomic access to shadow > and real memory). How fine should locking be done? One lock for all memory > accesses kills performance. Perhaps one lock for each allocated area / or > every 64 bytes of memory? This would need very fast user mode mutexes (e.g. > Futexes from Linux 2.5). > But all this is about problems in some specific skin. I currently wonder more > about problems with the generic valgrind core (binary translation engine and > runtime environment). Well, programs typically spend something like 80%+ of their time running generated code. So I think you could be quite crude with locking in the core and not suffer too much. But I could be wrong about that. Hmm... I guess you'd want one copy of VG_(innerloop) per thread on a multi-processor machine or performance would suck. Ach, this is getting a bit complex for me... > What are the most problematic Linux-centric things to get rid of? Threads, syscalls, signals, in particular the interactions of all of them. > I thought a bigger problem would be to support other processor architectures. > This seems an interesting goal to me as for skins, the architecture is > already hidden by using the RISC-like UCodes. Firstly, UCode looks platform independent but it's really tied quite closely to x86 -- consider the handling of condition codes, the fact that it's two-address, the way FPU/MMX/SSE instructions are handled, etc. Nonetheless, X-arch translation is a definite possibility. It's been done before by others so the basic ideas are fairly well worked out. Julian has had some thoughts about this, the difficult part is allowing arbitrary instrumentation in a way that is independent of the architecture; most instructions are fine (adds, movs, things like that) but every architecture has its own weirdo instructions; these are the difficult parts to handle. N |
|
From: Adam G. <ar...@cy...> - 2003-05-27 09:56:10
|
At 13:57 26/05/2003 +0200, Josef Weidendorfer wrote: >On Monday 26 May 2003 12:05, Nicholas Nethercote wrote: >> [...] >> Julian and I have discussed this, and AFAWCT the killer point is shadow >> memory for Memcheck -- each time memory is written shadow memory is >> written a few instructions before. The danger is that if two threads are >> racing on a memory word, you could get a thread switch in between the >> shadow write and the real write, and then your shadow memory would not >> match your real memory. > >Sidenote (perhaps there's a misunderstanding): I thought about 2 threads >running on 2 processors in parallel. There does not have any thread switching >for races to occur, as without locking, interleaved memory accesses from the >two threads/processors can happen in any order. you would need to guarantee that whatever instructions the program uses as locking primitives are atomic... probably the only way to do that is to insist they come from valgrind's pthread library etc. anything else where valgrind shows misbehaviour is probably a bug in the program (ie not using locking...) >As memory is a shared resource, we must use locking (atomic access to shadow >and real memory). How fine should locking be done? One lock for all memory >accesses kills performance. Perhaps one lock for each allocated area / or >every 64 bytes of memory? This would need very fast user mode mutexes (e.g. >Futexes from Linux 2.5). >But all this is about problems in some specific skin. I currently wonder more >about problems with the generic valgrind core (binary translation engine and >runtime environment). > >> The problem would be avoided if we could guarantee that thread switches >> only occur between basic blocks. Actually, now that I think about it, >> that shouldn't be a problem with the current implementation since it does >> thread scheduling itself, and never does thread switches in the middle of >> a basic block. Hmm. >> >> As for getting rid of Valgrind's threads implementation, the best idea so >> far is to intercept the clone() syscall, and have Valgrind schedule >> threads itself but not do all the pthreads ops itself. This sounds > >But with the szenario of kernel level threads, Valgrind can't schedule them ?! you really should have a look at the stuff in my WINE support patches... or the 1.9.6-wine tarball. I have modified valgrind to support clone(), but only one (kernel level) thread runs at once. what happens is the valgrind scheduler loop spins in each cloned thread, but only one is active at once (so there is no need for locking). signals are used to tell the cloned threads when to wake up. clone() (the library call) has been re-implemented to use a variant of valgrind's pthread_create() which guarantees a kernel thread is created. I suspect that there will be so much locking overhead imposed by multiple threads running simultaneously that it is quicker to just do a 'global' lock. the real killer will be memory accesses, particularly thread's stacks. You could possibly do something with mprotect(), but that would require the threads to have separate memory maps (and then valgrind fakes up the 'shared' memory map they expect). ugh. multi-processor machines use a LOT of hardware to cope with these issues (cache snooping etc etc). Seeya, Adam -- Real Programmers don't comment their code. If it was hard to write, it should be hard to read, and even harder to modify. These are all my own opinions. |
|
From: Josef W. <Jos...@gm...> - 2003-06-02 14:07:37
|
On Monday 26 May 2003 12:05, Nicholas Nethercote wrote: > On Mon, 26 May 2003, Josef Weidendorfer wrote: > > currently I'm thinking a little bit of what would be needed to allow > > applications run under Valgrind to use processors in parallel. The main > > goal would be to speed up cache simulation for multithreaded > > applications, more specially first to let OpenMP apps (number crunshing) > > run simultaneously. I'm not at all convinced if there will be any > > benefit/speedup at all on multiple processors because of a possible need > > for additional fine-grained communication among the threads. > > [...] Nick, Jeremy, Adam, thanks a lot for the responses. As the main goal of my thinking was (for now) about speeding up cache simulation, I trashed the idea of make V threads out of application threads, because of the synchronisation issues in the skins. A better idea would be to separate event handling (e.g. memory access, trackable valgrind events, ...) into another process(es), forked off at valgrind startup. By using a ring buffer shared to the event handling process, communication should be really fast, and the handler process can be run in parallel on a 2-processor machine (or on a P4 with hyperthreading - here, AFAIK busy polling would be a no-no, but there are workarounds for this problem?). I think this could be a second general "split" approach, almost orthogonal to core/skin splitting: instead of calling a event handler of the skin, the event could be put into the ring buffer. We even could make the communication bidirectional by using a second ring buffer, and if a event handler has to run synchroniously, block for an answer from the event handler process. The best would be to add the ring buffer communication in a transparent way to the event handler functions, much like RPC. This way, we could make it a runtime switch to use the event handler process or not. Advantages: * speedup on 2-processor machines / P4 with hyperthreading * normal use of GDB for the event handler process, eases development of handlers in contrast to skin development (the valgrind process will block because the ring buffer is full) * the event handler process can be valgrinded (!) * if the communication to the handler process is unidirectional, the events can be dumped directly to a file, and the event handler can run afterwards, using the stored events. Of course, this often will not be practical because of the huge amount of event data. But perhaps some compression could be done here. * the event handler can be a GUI application (allowing e.g. for a real trace visualisation, not only profile). Fortunately, for cache simulation, no results have to be feeded back from cache simulation to valgrind runtime. For cachegrind, almost all skin actions (e.g. BBCC allocation, cache simulation, trace dumping) could be separated into the event handler process. I could imagine that even memcheck could be splitted this way for most events, as error reporting can be done asynchroniously. What do you think about this idea? Josef |
|
From: Adam G. <ar...@cy...> - 2003-06-02 14:28:28
|
At 16:12 02/06/2003 +0200, Josef Weidendorfer wrote: >On Monday 26 May 2003 12:05, Nicholas Nethercote wrote: >> On Mon, 26 May 2003, Josef Weidendorfer wrote: >> > currently I'm thinking a little bit of what would be needed to allow >> > applications run under Valgrind to use processors in parallel. The main >> > goal would be to speed up cache simulation for multithreaded >> > applications, more specially first to let OpenMP apps (number crunshing) >> > run simultaneously. I'm not at all convinced if there will be any >> > benefit/speedup at all on multiple processors because of a possible need >> > for additional fine-grained communication among the threads. >> > [...] > >Nick, Jeremy, Adam, > >thanks a lot for the responses. >As the main goal of my thinking was (for now) about speeding up cache >simulation, I trashed the idea of make V threads out of application threads, >because of the synchronisation issues in the skins. <snip> >Fortunately, for cache simulation, no results have to be feeded back from >cache simulation to valgrind runtime. For cachegrind, almost all skin actions >(e.g. BBCC allocation, cache simulation, trace dumping) could be separated >into the event handler process. >I could imagine that even memcheck could be splitted this way for most events, >as error reporting can be done asynchroniously. > >What do you think about this idea? you would have to be _really_ efficient at marshalling/unmarshalling the events otherwise the IPC overhead would use up all the performance gain from 2 CPUs. maybe futexes are the answer? sounds good for cache simulation, since the events are one way only - I doubt it would be useful for memcheck though - the number of error events (should be) tiny. Seeya, Adam -- Real Programmers don't comment their code. If it was hard to write, it should be hard to read, and even harder to modify. These are all my own opinions. |
|
From: Josef W. <Jos...@gm...> - 2003-06-02 15:09:51
|
On Monday 02 June 2003 16:28, Adam Gundy wrote: > At 16:12 02/06/2003 +0200, Josef Weidendorfer wrote: > >Fortunately, for cache simulation, no results have to be feeded back from > >cache simulation to valgrind runtime. For cachegrind, almost all skin > > actions (e.g. BBCC allocation, cache simulation, trace dumping) could be > > separated into the event handler process. > >I could imagine that even memcheck could be splitted this way for most > > events, as error reporting can be done asynchroniously. > > > >What do you think about this idea? > > you would have to be _really_ efficient at marshalling/unmarshalling the > events otherwise the IPC overhead would use up all the performance gain > from 2 CPUs. maybe futexes are the answer? Marshalling should almost be a NOP, as an event tag should be enough for the handler to know the event argument types and how to extract it. For communication, busy polling without any lock should be fine on a 2-processor machine. update the ring buffer write pointer on the sender side after putting the event into the buffer, and the receiver polls on this pointer. For the 1-processor case, use a wait condition only to signal buffer full. The buffer should be of a size so that it won't be filled up in one time slice (?). > sounds good for cache simulation, since the events are one way only - I > doubt it would be useful for memcheck though - the number of error events > (should be) tiny. Couldn't be all the shadow memory stuff be done in the event handler process? Perhaps this split is even worth it only to allow for easier development/bug fixing/valgrinding of the memcheck event handler functions themself, and the normal case would be to run them in the V process again? (Perhaps this is nonsense, as I don't know enough about memcheck). But still, IMHO this would be one step to the goal of "allow valgrinding Valgrind itself", even if you only can valgrind the event handler side. Look at it as trying to put a skin into another process. I know, I should come up with a patch ;-) (this would involve a stub skin, and a stub Valgrind core loading the real skin, to forward the events.) Cheers, Josef |
|
From: Adam G. <ar...@cy...> - 2003-06-02 15:36:59
|
At 17:15 02/06/2003 +0200, Josef Weidendorfer wrote: >On Monday 02 June 2003 16:28, Adam Gundy wrote: >> At 16:12 02/06/2003 +0200, Josef Weidendorfer wrote: >> >Fortunately, for cache simulation, no results have to be feeded back from >> >cache simulation to valgrind runtime. For cachegrind, almost all skin >> > actions (e.g. BBCC allocation, cache simulation, trace dumping) could be >> > separated into the event handler process. >> >I could imagine that even memcheck could be splitted this way for most >> > events, as error reporting can be done asynchroniously. >> > >> >What do you think about this idea? >> >> you would have to be _really_ efficient at marshalling/unmarshalling the >> events otherwise the IPC overhead would use up all the performance gain >> from 2 CPUs. maybe futexes are the answer? > >Marshalling should almost be a NOP, as an event tag should be enough for the >handler to know the event argument types and how to extract it. >For communication, busy polling without any lock should be fine on a >2-processor machine. update the ring buffer write pointer on the sender side >after putting the event into the buffer, and the receiver polls on this >pointer. For the 1-processor case, use a wait condition only to signal buffer >full. >The buffer should be of a size so that it won't be filled up in one time slice >(?). busy polling sounds like a bad idea... even a really short usleep() should help >> sounds good for cache simulation, since the events are one way only - I >> doubt it would be useful for memcheck though - the number of error events >> (should be) tiny. > >Couldn't be all the shadow memory stuff be done in the event handler process? >Perhaps this split is even worth it only to allow for easier development/bug >fixing/valgrinding of the memcheck event handler functions themself, and the >normal case would be to run them in the V process again? >(Perhaps this is nonsense, as I don't know enough about memcheck). writes but not reads... memcheck is constantly checking the shadow memory state as well as updating it. >But still, IMHO this would be one step to the goal of "allow valgrinding >Valgrind itself", even if you only can valgrind the event handler side. >Look at it as trying to put a skin into another process. yes, I agree that it would at least enable this. if you do write a generic 'proxy' skin, it should definitely be configurable. >I know, I should come up with a patch ;-) >(this would involve a stub skin, and a stub Valgrind core loading the real >skin, to forward the events.) it occurs to me there is no need to limit it to 2 CPUs: if you have a four CPU machine you can spawn three 'worker' processes (with a ring buffer each). If the events posted to the ring buffers (round robin unless one is full) have timestamps (just an incrementing integer), then the workers can sort things out later... Seeya, Adam -- Real Programmers don't comment their code. If it was hard to write, it should be hard to read, and even harder to modify. These are all my own opinions. |
|
From: Josef W. <Jos...@gm...> - 2003-06-02 18:14:25
|
On Monday 02 June 2003 17:37, Adam Gundy wrote: > At 17:15 02/06/2003 +0200, Josef Weidendorfer wrote: > > [...] > >Marshalling should almost be a NOP, as an event tag should be enough for > > the handler to know the event argument types and how to extract it. > >For communication, busy polling without any lock should be fine on a > >2-processor machine. update the ring buffer write pointer on the sender > > side after putting the event into the buffer, and the receiver polls on > > this pointer. For the 1-processor case, use a wait condition only to > > signal buffer full. > >The buffer should be of a size so that it won't be filled up in one time > > slice (?). > > busy polling sounds like a bad idea... even a really short usleep() should > help Hmmm... OK. IMHO busy polling is the fastest (to clear the buffer again), but it only makes sense if you have a 2nd dedicated CPU, and as this will be almost never the case... So what about a buffer, splitted in two parts with a wait condition for each part, meaning "buffer part full". So if half of the ring buffer gets full, the handler process is notified. And make a timeout in the handler to check regularily for data (when little event data will be produced). > >> sounds good for cache simulation, since the events are one way only - I > >> doubt it would be useful for memcheck though - the number of error > >> events (should be) tiny. > > > >Couldn't be all the shadow memory stuff be done in the event handler > > process? Perhaps this split is even worth it only to allow for easier > > development/bug fixing/valgrinding of the memcheck event handler > > functions themself, and the normal case would be to run them in the V > > process again? > >(Perhaps this is nonsense, as I don't know enough about memcheck). > > writes but not reads... memcheck is constantly checking the shadow memory > state as well as updating it. Yup. On reads, an error to be reported could happen. But why has Valgrind core to stop, and wait to see if a read will give you an error? The error should be printed by the handler. Perhaps a problem would be that the handler has to track the stack frames for the backtrace in the error message. Ok, you can't attach a debugger when the error happens :-( > >But still, IMHO this would be one step to the goal of "allow valgrinding > >Valgrind itself", even if you only can valgrind the event handler side. > >Look at it as trying to put a skin into another process. > > yes, I agree that it would at least enable this. if you do write a generic > 'proxy' skin, it should definitely be configurable. For a general proxy, even instrumentation would have to be done by the handler side, and transfering UCode around for each basic block will be awefully slow (aside from the 2-processor case with busy polling?). OTOH, at development time of the skin this feature sure would be worth it. Better let the skin do the instrumentation and generate custom events to allow the handler process to set up structures depending on this instrumentation. Which means this can't be done transparently for skins... > >I know, I should come up with a patch ;-) > >(this would involve a stub skin, and a stub Valgrind core loading the real > >skin, to forward the events.) > > it occurs to me there is no need to limit it to 2 CPUs: if you have a four > CPU machine you can spawn three 'worker' processes (with a ring buffer > each). If the events posted to the ring buffers (round robin unless one is > full) have timestamps (just an incrementing integer), then the workers can > sort things out later... The problem is that most of the time, the data produced/updated in event handlers depend on each other: E.g. the state of the cache after one access decides if the next access will be a miss, even if we would have multiple caches to simulate because of cache coherence protocols. That's the original problem to this thread. A solution would be a pipeline of workers, splitting up the work for one event, e.g. let the first simulate the 1st level cache, the next the 2nd level cache, and the last the structure updating. Josef |
|
From: Adam G. <ar...@cy...> - 2003-06-03 09:13:47
|
At 20:19 02/06/2003 +0200, Josef Weidendorfer wrote: >On Monday 02 June 2003 17:37, Adam Gundy wrote: >> At 17:15 02/06/2003 +0200, Josef Weidendorfer wrote: >> > [...] >> >Marshalling should almost be a NOP, as an event tag should be enough for >> > the handler to know the event argument types and how to extract it. >> >For communication, busy polling without any lock should be fine on a >> >2-processor machine. update the ring buffer write pointer on the sender >> > side after putting the event into the buffer, and the receiver polls on >> > this pointer. For the 1-processor case, use a wait condition only to >> > signal buffer full. >> >The buffer should be of a size so that it won't be filled up in one time >> > slice (?). >> >> busy polling sounds like a bad idea... even a really short usleep() should >> help > >Hmmm... OK. >IMHO busy polling is the fastest (to clear the buffer again), but it only >makes sense if you have a 2nd dedicated CPU, and as this will be almost never >the case... > >So what about a buffer, splitted in two parts with a wait condition for each >part, meaning "buffer part full". So if half of the ring buffer gets full, >the handler process is notified. And make a timeout in the handler to check >regularily for data (when little event data will be produced). sounds like a serial port fifo.... (eg a 16550 where you have a 16 byte buffer, and you can set the level at which an interrupt is generated. depending on how quickly you can clear the buffer you set it to 8, 12 etc, so there is still some buffer space to be filled while the interrupt is being serviced). a direct analogy of this would be eg a 16Mb buffer, then valgrind sends a real-time 'empty the buffer' signal to the worker process when it is 8Mb full etc. the idea is that valgrind would automatically adjust to the speed of signal delivery etc by altering the point at which it decides a signal needs to be sent. >> >> sounds good for cache simulation, since the events are one way only - I >> >> doubt it would be useful for memcheck though - the number of error >> >> events (should be) tiny. >> > >> >Couldn't be all the shadow memory stuff be done in the event handler >> > process? Perhaps this split is even worth it only to allow for easier >> > development/bug fixing/valgrinding of the memcheck event handler >> > functions themself, and the normal case would be to run them in the V >> > process again? >> >(Perhaps this is nonsense, as I don't know enough about memcheck). >> >> writes but not reads... memcheck is constantly checking the shadow memory >> state as well as updating it. > >Yup. >On reads, an error to be reported could happen. But why has Valgrind core to >stop, and wait to see if a read will give you an error? The error should be >printed by the handler. Perhaps a problem would be that the handler has to >track the stack frames for the backtrace in the error message. >Ok, you can't attach a debugger when the error happens :-( you would have to send a stack trace with every memory test - this would be horribly slow. hmmm. I guess delta compression of the stack trace would help, since typically only the top frame changes... you would also need some sort of optimized stack trace maintained for each thread as well - track calls and returns, so that most of the time you only need to look at EIP to get the current stack trace. there are some odd cases where RET is not used which would cause pain. it might work... >> >But still, IMHO this would be one step to the goal of "allow valgrinding >> >Valgrind itself", even if you only can valgrind the event handler side. >> >Look at it as trying to put a skin into another process. >> >> yes, I agree that it would at least enable this. if you do write a generic >> 'proxy' skin, it should definitely be configurable. > >For a general proxy, even instrumentation would have to be done by the handler >side, and transfering UCode around for each basic block will be awefully slow >(aside from the 2-processor case with busy polling?). OTOH, at development >time of the skin this feature sure would be worth it. > >Better let the skin do the instrumentation and generate custom events to allow >the handler process to set up structures depending on this instrumentation. >Which means this can't be done transparently for skins... maybe you need to write two kinds of proxy ;-) Seeya, Adam -- Real Programmers don't comment their code. If it was hard to write, it should be hard to read, and even harder to modify. These are all my own opinions. |
|
From: Josef W. <Jos...@gm...> - 2003-06-03 10:06:18
|
On Tuesday 03 June 2003 11:10, Adam Gundy wrote: > At 20:19 02/06/2003 +0200, Josef Weidendorfer wrote: > >So what about a buffer, splitted in two parts with a wait condition for > > each part, meaning "buffer part full". So if half of the ring buffer gets > > full, the handler process is notified. And make a timeout in the handler > > to check regularily for data (when little event data will be produced). > > sounds like a serial port fifo.... (eg a 16550 where you have a 16 byte > buffer, and you can set the level at which an interrupt is generated. > depending on how quickly you can clear the buffer you set it to 8, 12 etc, > so there is still some buffer space to be filled while the interrupt is > being serviced). > > a direct analogy of this would be eg a 16Mb buffer, then valgrind sends a > real-time 'empty the buffer' signal to the worker process when it is 8Mb > full etc. the idea is that valgrind would automatically adjust to the speed > of signal delivery etc by altering the point at which it decides a signal > needs to be sent. Good idea. > >> >> sounds good for cache simulation, since the events are one way only - > >> >> I doubt it would be useful for memcheck though - the number of error > >> >> events (should be) tiny. > >> > > >> >Couldn't be all the shadow memory stuff be done in the event handler > >> > process? Perhaps this split is even worth it only to allow for easier > >> > development/bug fixing/valgrinding of the memcheck event handler > >> > functions themself, and the normal case would be to run them in the V > >> > process again? > >> >(Perhaps this is nonsense, as I don't know enough about memcheck). > >> > >> writes but not reads... memcheck is constantly checking the shadow > >> memory state as well as updating it. > > > >Yup. > >On reads, an error to be reported could happen. But why has Valgrind core > > to stop, and wait to see if a read will give you an error? The error > > should be printed by the handler. Perhaps a problem would be that the > > handler has to track the stack frames for the backtrace in the error > > message. > >Ok, you can't attach a debugger when the error happens :-( > > you would have to send a stack trace with every memory test - this would be > horribly slow. hmmm. I guess delta compression of the stack trace would > help, since typically only the top frame changes... you would also need > some sort of optimized stack trace maintained for each thread as well - > track calls and returns, so that most of the time you only need to look at > EIP to get the current stack trace. there are some odd cases where RET is > not used which would cause pain. Luckily, this work is already done. In my calltree skin, I track the call chain on my own for each thread, using CALL/RET/JUMP events with the content of %esp. It handles situations like longjmp and exception handling quite fine. Josef |
|
From: Jeremy F. <je...@go...> - 2003-05-26 16:28:01
|
On Mon, 2003-05-26 at 02:42, Josef Weidendorfer wrote:
> currently I'm thinking a little bit of what would be needed to allow
> applications run under Valgrind to use processors in parallel. The main goal
> would be to speed up cache simulation for multithreaded applications, more
> specially first to let OpenMP apps (number crunshing) run simultaneously.
> I'm not at all convinced if there will be any benefit/speedup at all on
> multiple processors because of a possible need for additional fine-grained
> communication among the threads.
I've been thinking about this too. I think the real difficulty is in
the skins rather than the core. The core only has occasional relatively
large chunks of work to do (ie, translate a basic block). It would be
fairly easy to make a translation hash miss do the right things (you'd
probably do it locklessly, so that the hash update is an atomic
operation; if two threads happen to want the same basic block at the
same time, they'd both translate it, but one would win and the other
would be thrown away).
The real problem is that the skins want to do data structure updates on
an instruction by instruction level. Nick mentioned memcheck; helgrind
is an even more extreme example, since it actually cares a lot about the
program's precise thread and lock behaviour, and what threads touch
which memory in what order.
The only reasonable way I can see to implement it would be to generate
inline atomic operations, rather than using mutexes. Unfortunately I
think that would still have an extreme amount of overhead; probably
enough to overwhelm any possible performance benefit of multiple CPUs.
There would also be a some memory overhead, since you'd have to include
space for locks in the data, though you could choose the density as a
tradeoff between memory use and concurrency.
A much more complex, but perhaps efficient way to do it, would be to
make all skins which care about keeping per-byte memory metadata behave
more like helgrind. That is, have the skin classify heap memory as
being "per-thread" or "shared". Per-thread memory+metadata could be
handled without any locking. As soon as another thread touches it, you
would convert it to "shared", which requires locked access. Handling
this transition would be tricky, as would handling the codegen issues
(would you generate all memory accesses as if they could be shared, or
would you regenerate those memory accesses which turn out to be
shared?). The problem with this approach, like so many other possible
"optimisations" for Valgrind, is that the overhead of all the
bookkeeping could easily remove any benefit.
Of course, skins which don't keep per-byte metadata about memory
(cachegrind, vgprof, etc) can just keep everything per-thread and
reconcile at the end.
The other killer is that it would make writing Valgrind itself and skins
a lot more complex. Valgrind is hard enough to get right as it is;
adding concurrency would simply make the tool itself a lot less
trustworthy.
> * Signal handling?
Signal handling+threads = <shudder>
> * What's with Valgrinds version of the pthread library? Do you think that it's
> a big task to make this reentrant-safe? Or perhaps we even could get rid of
> our own implementation?
I think we if we do go this path, then we can take a step back. Rather
than emulating threading at the pthread level, we can emulate it at the
clone system call level, and therefore allow any user-space pthreads
implementation. We may still want to intercept pthreads library calls
so that we have a better idea of what the program is actually trying to
achieve.
J
|