|
From: Josef W. <Jos...@gm...> - 2007-03-27 14:07:28
|
Hi, recently I was playing with the idea to speed up cache simulation by using 2 cores in these (not that new) dual-core processors. The idea is to run the cache simulation in another thread (*1*), feeded by memory access events written into a buffer by cachegrind generated as usual. In a first step, separating the event producer and the consumer via a simple buffer works quite nicely, still in sequential mode; the patch even is quite minimal, as I have a 1:1 relation between log_* simulation calls and event types, producing the same result. Cachegrinds instrumentation does not call the simulation routines directly, but writes the data into a buffer. At beginning of a SB, it is checked if there is enough space available for all events which could be produced by this SB. If the buffer is full, a helper is run which does simulation for all events sitting in the buffer, clearing it afterwards. The nice thing is that this helper call is the only C call needed in the instrumentation, and it is guarded by the "full" condition (*2*). Using a buffer of 1 KB, I thought that this scheme should run roughly at the seem speed as original cachegrind. However, quite surprisingly, the small event dispatcher loop in the helper takes around 30% time with some cache friendly client code (I used "rpm -q <package>"). The measurement was done with OProfile, and should be quite fine. It looks like branch mispredictions are responsible for this, and I checked this against measurement with according hardware performance counters. My example did around 30 million events per second, so a branch misprediction at every event can explain this 30% figure. Cachegrinds instrumentation itself never sees a misprediction, as the generated instrumentation knows which simulation functions to call. However, when dispatching events from the buffer, there is no possibility to predict the next event. Currently I am searching for a way to transfer the knowledge of the event producer (cachegrinds instrumentation) to the consumer (the dispatch loop) to get less mispredictions there. Perhaps someone has better ideas? Idea (1): Coding multiple events into bigger ones would reduce the event count, and thus, number of mispredictions. However, the number of dispatch cases explodes, and makes the code bigger. Idea (2): Use VEX to generate code pieces for the consumer which match one client SB each. The generated code has exactly the knowledge, which memory access events happen in a given client SB. This totally would get rid of mispredictions, and has the nice effect to reduce the number of data which has to be sent to the event consumer, as all fixed values (instruction addresses, sizes, data sizes) are already incorporated in the generated code. However, I have no idea how to make this work. Generating a further IRSB while instrumenting a client SB is easy, but how to run VEX from there to generate real code? The generated code has to be put into the translation cache, but the tool should have some way to control its live cycle. At least, there has to be some hook for the tool before such a code would be discarded by Valgrind. Or would it be useful to have a tool-controlled separate translation cache? Perhaps this idea is crazy, and in the end not worthwhile. For the visioned parallelization to be useful, the communication overhead to the other core needs to be really low, which means to minimize MESI transactions. Josef (*1*) Actually, I started adding a new VG_(createThread) to coregrind, which uses clone() to spawn a helper thread for a tool. However, I do not really know how to do this; it works somehow, but not really well: I get an assertion failure when pressing Ctrl-C from Valgrind. Should there be signal handlers for such a helper thread which pass any signals simply back to VG? (*2*) One question about guards for dirty helper calls: Is it correct that using a guard together with a return value is useless? As a temporary register can be only written once (SSA constrain), what is the value of the temporary register for the return value if there is no call happening? It would be useful to provide a IRExpr to assign to the temporary when the call was not done. |
|
From: Julian S. <js...@ac...> - 2007-03-27 15:05:17
|
> It looks like branch mispredictions are responsible for this, and
> I checked this against measurement with according hardware performance
> counters. My example did around 30 million events per second, so a
> branch misprediction at every event can explain this 30% figure.
Maybe you need to give the branch predictors more code to correlate the
dispatcher loop branch(es) against. One possibility is to unroll by hand
the dispatcher loop a few times, so that each switch can be correlated
to some extent with branches in earlier switches:
before:
for (/*iterate over events*) {
switch (event[i]) {
...
}
}
-->
for (/*iterate over events*) {
switch (event[i]) {
...
}
switch (event[i+1]) {
...
}
switch (event[i+2]) {
...
}
switch (event[i+3]) {
...
}
}
I would add only that I have tried games like this before (in m_dispatch/*.S)
and although it is possible sometimes to get large wins, I found it difficult
to do that consistently. I found it to be like chasing ghosts.
> (*2*) One question about guards for dirty helper calls: Is it correct
> that using a guard together with a return value is useless?
> As a temporary register can be only written once (SSA constrain), what
> is the value of the temporary register for the return value if there
> is no call happening? It would be useful to provide a IRExpr to
> assign to the temporary when the call was not done.
Looks a genuine semantic hole in IR afaics. (!)
I have no good answer. I never thought of this before.
I looked at the x86 backend to see what it would do in that situation. It
generates code to do the call (or not), and then copies %eax into the register
assigned to the temporary, regardless of whether the call happened.
Which means it will be garbage if the call did not happen.
(VEX/priv/host-x86/isel.c:3566)
Maybe a dirty helper should specify a default value for the temporary if
the call is not taken. That would make it safe. If the backend can prove
the call is always taken then it can omit computation/assignment of the
default and so avoid performance loss in the common cases. Hmm, this is
all a bit messy. Need to think about it more.
J
|
|
From: Julian S. <js...@ac...> - 2007-03-27 15:12:55
|
> It looks like branch mispredictions are responsible for this, and > I checked this against measurement with according hardware performance > counters. My example did around 30 million events per second, so a > branch misprediction at every event can explain this 30% figure. If I understand your problem correctly, what you describe is the same problem that people have when writing bytecode interpreters. This is a quite-well-studied problem. See http://www.csc.uvic.ca/~csc586a/slides/BranchPredict.pdf for a good discussion and suggestions. J |
|
From: Josef W. <Jos...@gm...> - 2007-03-27 15:40:50
|
On Tuesday 27 March 2007, Julian Seward wrote: > > > It looks like branch mispredictions are responsible for this, and > > I checked this against measurement with according hardware performance > > counters. My example did around 30 million events per second, so a > > branch misprediction at every event can explain this 30% figure. > > If I understand your problem correctly, what you describe is the same > problem that people have when writing bytecode interpreters. Hmm.. Yes, should be quite similar. > This is > a quite-well-studied problem. See > http://www.csc.uvic.ca/~csc586a/slides/BranchPredict.pdf > for a good discussion and suggestions. Thanks for the pointer! Josef |
|
From: Nicholas N. <nj...@cs...> - 2007-03-27 21:53:06
|
On Tue, 27 Mar 2007, Josef Weidendorfer wrote: > recently I was playing with the idea to speed up cache simulation > by using 2 cores in these (not that new) dual-core processors. > > The idea is to run the cache simulation in another thread (*1*), > feeded by memory access events written into a buffer by > cachegrind generated as usual. There's a paper "SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance" by Wallace and Hazelwood that discusses this topic. > In a first step, separating the event producer and the consumer via > a simple buffer works quite nicely, still in sequential mode; > the patch even is quite minimal, as I have a 1:1 relation between > log_* simulation calls and event types, producing the same result. > Cachegrinds instrumentation does not call the simulation routines > directly, but writes the data into a buffer. At beginning of a SB, > it is checked if there is enough space available for all events > which could be produced by this SB. > If the buffer is full, a helper is run which does simulation for > all events sitting in the buffer, clearing it afterwards. The nice > thing is that this helper call is the only C call needed in the > instrumentation, and it is guarded by the "full" condition (*2*). > > Using a buffer of 1 KB, I thought that this scheme should run > roughly at the seem speed as original cachegrind. > > However, quite surprisingly, the small event dispatcher loop in the > helper takes around 30% time with some cache friendly client code > (I used "rpm -q <package>"). The measurement was done with OProfile, > and should be quite fine. > > It looks like branch mispredictions are responsible for this, and > I checked this against measurement with according hardware performance > counters. My example did around 30 million events per second, so a > branch misprediction at every event can explain this 30% figure. > > Cachegrinds instrumentation itself never sees a misprediction, as the > generated instrumentation knows which simulation functions to call. > However, when dispatching events from the buffer, there is no > possibility to predict the next event. So what was the overall slow-down of this producer/consumer version vs original? I tried the same thing a few years ago, and found that the number of instructions was greatly reduced (by roughly 25% IIRC) but that the increase in branch mispredictions neutralized that and speed was similar to the original, and the producer/consumer version was more complex. > Currently I am searching for a way to transfer the knowledge > of the event producer (cachegrinds instrumentation) to the consumer > (the dispatch loop) to get less mispredictions there. Perhaps someone > has better ideas? > > Idea (1): > Coding multiple events into bigger ones would reduce the event count, > and thus, number of mispredictions. However, the number of dispatch cases > explodes, and makes the code bigger. > > Idea (2): > Use VEX to generate code pieces for the consumer which match one client > SB each. The generated code has exactly the knowledge, which memory > access events happen in a given client SB. > This totally would get rid of mispredictions, and has the nice effect > to reduce the number of data which has to be sent to the event consumer, > as all fixed values (instruction addresses, sizes, data sizes) are > already incorporated in the generated code. > > However, I have no idea how to make this work. Generating a further > IRSB while instrumenting a client SB is easy, but how to run VEX from > there to generate real code? The generated code has to be put into > the translation cache, but the tool should have some way to control > its live cycle. At least, there has to be some hook for the tool > before such a code would be discarded by Valgrind. > Or would it be useful to have a tool-controlled separate translation > cache? > > Perhaps this idea is crazy, and in the end not worthwhile. > For the visioned parallelization to be useful, the communication > overhead to the other core needs to be really low, which > means to minimize MESI transactions. It, like a lot of parallel programming tasks, sounds difficult and fragile. N |
|
From: Nicholas N. <nj...@cs...> - 2007-03-27 21:57:03
|
On Tue, 27 Mar 2007, Julian Seward wrote: >> (*2*) One question about guards for dirty helper calls: Is it correct >> that using a guard together with a return value is useless? >> As a temporary register can be only written once (SSA constrain), what >> is the value of the temporary register for the return value if there >> is no call happening? It would be useful to provide a IRExpr to >> assign to the temporary when the call was not done. > > Looks a genuine semantic hole in IR afaics. (!) > > I have no good answer. I never thought of this before. > > I looked at the x86 backend to see what it would do in that situation. It > generates code to do the call (or not), and then copies %eax into the register > assigned to the temporary, regardless of whether the call happened. > Which means it will be garbage if the call did not happen. > (VEX/priv/host-x86/isel.c:3566) > > Maybe a dirty helper should specify a default value for the temporary if > the call is not taken. That would make it safe. If the backend can prove > the call is always taken then it can omit computation/assignment of the > default and so avoid performance loss in the common cases. Hmm, this is > all a bit messy. Need to think about it more. Or you could just disallow return values in conditional calls? That would be fine with all the current uses, and with Josef's proposed usage too, if I understand correctly. More generally, how do you represent a conditional move in SSA? I don't think you can, conditional moves don't really make sense when you don't have specific locations (eg. registers) for holding values. N |
|
From: Julian S. <js...@ac...> - 2007-03-27 22:50:07
|
> Or you could just disallow return values in conditional calls? Yes, that sounds simpler. Good. > More generally, how do you represent a conditional move in SSA? Maybe with a ternary ?-: style operator? In Vex it's done using Mux0X. J |
|
From: Josef W. <Jos...@gm...> - 2007-03-28 17:07:07
|
On Tuesday 27 March 2007, Nicholas Nethercote wrote: > > Maybe a dirty helper should specify a default value for the temporary if > > the call is not taken. That would make it safe. If the backend can prove > > the call is always taken then it can omit computation/assignment of the > > default and so avoid performance loss in the common cases. Hmm, this is > > all a bit messy. Need to think about it more. > > Or you could just disallow return values in conditional calls? That would > be fine with all the current uses, and with Josef's proposed usage too, if I > understand correctly. My instrumentation loads the buffer's write pointer into a temporary reg at beginning of a SB. This has to be done after the helper call, as that one can change the write pointer. However, I already need the write pointer before for calculation of the guard boolean, and the second load should not be needed. So a default return value when there is no helper call done, indeed would be nice. Josef > More generally, how do you represent a conditional move in SSA? I don't > think you can, conditional moves don't really make sense when you don't have > specific locations (eg. registers) for holding values. > > N > |
|
From: Julian S. <js...@ac...> - 2007-03-27 22:55:23
|
Josef > > If I understand your problem correctly, what you describe is the same > > problem that people have when writing bytecode interpreters. > > Hmm.. Yes, should be quite similar. > > > This is > > a quite-well-studied problem. See > > http://www.csc.uvic.ca/~csc586a/slides/BranchPredict.pdf > > for a good discussion and suggestions. > > Thanks for the pointer! Welcome. If you get any speedups as a result of these tricks I'd be interested to hear about them, because the main valgrind dispatcher (m_dispatch) suffers from exactly the same problem. J |
|
From: Josef W. <Jos...@gm...> - 2007-03-28 16:55:17
|
On Tuesday 27 March 2007, you wrote: > On Tue, 27 Mar 2007, Josef Weidendorfer wrote: > > > recently I was playing with the idea to speed up cache simulation > > by using 2 cores in these (not that new) dual-core processors. > > > > The idea is to run the cache simulation in another thread (*1*), > > feeded by memory access events written into a buffer by > > cachegrind generated as usual. > > There's a paper "SuperPin: Parallelizing Dynamic Instrumentation for > Real-Time Performance" by Wallace and Hazelwood that discusses this topic. Oh, thanks. > So what was the overall slow-down of this producer/consumer version vs > original? Depends. Here some preliminary results on a Core2, 2.6 Gz. With client code "perf/tinycc -c perf/test_input_for_tinycc.c", which gives D1 miss rate of 0.2%, and L2 miss rate of 0.0% (do we have a test which gives low hit rate?). 32 bit version VG+client (878 millions client instructions) Original Cachegrind (3.3.0.SVN): 24.2 s Using buffer and producer/consumer with event tag in message, 1 switch: 27.8 s (14% slowdown) dito, multiple switches: 27.1 s (11% slowdown) with handler function pointers in message & one indirect call: 29.0 s (19% slowdown) Using multiple calls, ie. an indirect call at end of every event handler has the problem that you get a huge call chain and occupy much stack space. I'm just trying to come up with a threaded version with computed goto's. However, this version is very dumb, and puts a message into the buffer for every original log_* function of the current version of Cachegrind, inclusive all the parameters and also with some header overhead. It should be possible to make one message per SB, and putting e.g. the data_size back to InstrInfo, thus reducing the message size. BTW, the 64bit version is even slower because of the large messages. > I tried the same thing a few years ago, and found that the number of > instructions was greatly reduced (by roughly 25% IIRC) but that the increase > in branch mispredictions neutralized that and speed was similar to the > original, and the producer/consumer version was more complex. The separation of producer and consumer theoretically has the option to split them either into separate threads or processes. The latter would free the development of online postprocessing from Valgrind constrains such as C/assembler only and limited runtime library (no pthreads), even if you do not think about parallelization of the cache simulation. Where did you get the instruction reduction in your version? How were the messages built up? > > Perhaps this idea is crazy, and in the end not worthwhile. > > For the visioned parallelization to be useful, the communication > > overhead to the other core needs to be really low, which > > means to minimize MESI transactions. > > It, like a lot of parallel programming tasks, sounds difficult and fragile. Yes, however, a simple producer/consumer scheme should not be very complicated. Why do you think this could be fragile? However, for sure it is not easy to really get speedups. First, there is the problem of the pure communication overhead. Second, you have the load balancing issue. Josef |
|
From: Josef W. <Jos...@gm...> - 2007-03-28 20:11:40
|
On Wednesday 28 March 2007, Josef Weidendorfer wrote: > Depends. Here some preliminary results on a Core2, 2.6 Gz. > With client code "perf/tinycc -c perf/test_input_for_tinycc.c", > which gives D1 miss rate of 0.2%, and L2 miss rate of 0.0% > (do we have a test which gives low hit rate?). > > 32 bit version VG+client (878 millions client instructions) > Original Cachegrind (3.3.0.SVN): 24.2 s > Using buffer and producer/consumer > with event tag in message, 1 switch: 27.8 s (14% slowdown) > dito, multiple switches: 27.1 s (11% slowdown) > with handler function pointers in message > & one indirect call: 29.0 s (19% slowdown) > > Using multiple calls, ie. an indirect call at end of every > event handler has the problem that you get a huge call chain and > occupy much stack space. > I'm just trying to come up with a threaded version with computed > goto's. Update: With computed goto and multiple indirect jumps I get the 27.1 s figure as with the multiple switch statements. Actually, that probably was to be expected. However, this was a little nasty: the multiple "goto *next"'s were optimized into one indirect jump only. So I had to use "asm volatile ..." to force the multiple jumps. Josef |
|
From: Josef W. <Jos...@gm...> - 2007-03-28 17:21:16
|
On Wednesday 28 March 2007, you wrote: > Welcome. If you get any speedups as a result of these tricks I'd > be interested to hear about them, because the main valgrind dispatcher > (m_dispatch) suffers from exactly the same problem. I thought that SB chaining reduces the problem here, does it? Or is this specifically about indirect jumps in the client code? It really would be nice if there was a way to directly influence the branch predictor (eg. write an entry into the BTB). This would solve the problem at least for me, as I can look up the next target for the indirect jump way before it is executed. Recently I read a paper about the IBM Cell, and for the SPUs (quite simplistic SIMD cores), you have to specify the possible branch target yourself to make the branch "predictor" work. Exactly what would be needed here ;-) Josef > > J > |
|
From: Nicholas N. <nj...@cs...> - 2007-03-28 21:54:16
|
On Wed, 28 Mar 2007, Josef Weidendorfer wrote: > On Wednesday 28 March 2007, you wrote: >> Welcome. If you get any speedups as a result of these tricks I'd >> be interested to hear about them, because the main valgrind dispatcher >> (m_dispatch) suffers from exactly the same problem. > > I thought that SB chaining reduces the problem here, does it? It would, if Valgrind did it :) It used to (in the UCode days), but it hasn't been implemented for Vex yet. Vex does chase across some unconditional branches, though. Nick |