|
From: Josef W. <Jos...@gm...> - 2012-10-06 00:19:19
|
Hi, I just committed a commit to get rid of the huge macro in cachegrind (no functional/performance changes). We already discussed that quite a while ago, and I thought it is the right time to visit the topic. The second patch simplifies the simulator by using memory block numbers. It's more or less cosmetic, but it helps a bit. I will change Callgrind in the same way. But this mail is meant for two patches on top, speeding up Cachegrind by 30% on average, as attached. Both introduce special cases of the Ir event, and each improves Cachegrind by around 15%. * Most Ir's do not cross cache lines, so one can move work to instrumentation time. It changes the semantic of Ir event to be the special, but common case, and adds "IrG" as fallback - G for "general case". (IrG.patch). * Of the Ir's touching only one cache lines, quite some go into the same line, and do not change cache state. Thus, the simulation call can be avoided. I call that IrS, S for "sequential access". As it makes sense to combine both Ir and IrS with Dr/Dw etc, this adds quite some handlers for combined events, but a few handlers could be removed (IrS.patch). Then I have a 3rd one reducing parameters for dirty helpers, which seems to be beneficial for x86 (32bit). For x86_64 it does not matter, but it always reduces the generated code size (pars.patch). What do you think? Josef |
|
From: Julian S. <js...@ac...> - 2012-10-08 09:19:38
|
All looks fine to me. This is an implementation-only change, yes? The insn-read and insn-miss numbers that the user sees will not change? My only comment re the patches is that the names Ir, IrS, IrG might be a bit confusing, especially the first. Maybe IrS -> IrSeq, IrG -> IrGen, and Ir -> IrNoX (No X == no crossing of cache line) ? J On Saturday, October 06, 2012, Josef Weidendorfer wrote: > Hi, > > I just committed a commit to get rid of the huge macro in cachegrind > (no functional/performance changes). We already discussed that quite a > while ago, and I thought it is the right time to visit the topic. > The second patch simplifies the simulator by using memory block numbers. > It's more or less cosmetic, but it helps a bit. > I will change Callgrind in the same way. > > But this mail is meant for two patches on top, speeding up > Cachegrind by 30% on average, as attached. > Both introduce special cases of the Ir event, and each improves > Cachegrind by around 15%. > > * Most Ir's do not cross cache lines, so one can move work to > instrumentation time. It changes the semantic of Ir event to be the > special, but common case, and adds "IrG" as fallback - G for "general > case". (IrG.patch). > > * Of the Ir's touching only one cache lines, quite some go into > the same line, and do not change cache state. Thus, the simulation > call can be avoided. I call that IrS, S for "sequential access". > As it makes sense to combine both Ir and IrS with Dr/Dw etc, this > adds quite some handlers for combined events, but a few handlers > could be removed (IrS.patch). > > Then I have a 3rd one reducing parameters for dirty helpers, which > seems to be beneficial for x86 (32bit). For x86_64 it does not matter, > but it always reduces the generated code size (pars.patch). > > > What do you think? > > Josef |
|
From: Josef W. <Jos...@gm...> - 2012-10-10 18:57:46
|
Hi Julian, thanks for the comments! Am 08.10.2012 11:14, schrieb Julian Seward: > > All looks fine to me. This is an implementation-only change, yes? Yes. At least, it is expected to be. That's actually a tricky problem. When the cachegrind binary changes, also a lot of client addresses will change, and there can be new/other cache conflicts misses. Hmm.. there is currently a bug in the second patch. The common case I actually wanted to check for is "Is this Ir completely in a cache line accessed by the previous Ir?". The test "is it sequentially after the previous?" is not the same, as the new Ir could be aligned at a new cache line, which needs a call into the simulator... but this is easy to fix. Using a test case which only has a few hot regions did not show this issue. However with "tinycc" it became obvious. > The insn-read and insn-miss numbers that the user sees will not > change? > > My only comment re the patches is that the names Ir, IrS, IrG > might be a bit confusing, especially the first. Maybe IrS -> IrSeq, Using longer names is good. I need to change this to "IrSame" (same cache line, see above). > IrG -> IrGen, and Ir -> IrNoX (No X == no crossing of cache line) ? Sounds good. Actually, I did not do "Ir -> IrNoX" (or similar) to reduce patch size. BTW, I am not really comfortable with the IrS (also a fixed one) patch yet. An IrS does not need to call into the cache simulator, but there still is a dirty helper call. If I could get rid of it, it should be faster, and there would not be this explosion of combined events to handle. As far as I understand, for any Ir which is not the first in a SB or after a side exit, the number of executions must be the same as for the previous Ir. So a post-processing should be able to correct the counts. Josef > > J > > > On Saturday, October 06, 2012, Josef Weidendorfer wrote: >> Hi, >> >> I just committed a commit to get rid of the huge macro in cachegrind >> (no functional/performance changes). We already discussed that quite a >> while ago, and I thought it is the right time to visit the topic. >> The second patch simplifies the simulator by using memory block numbers. >> It's more or less cosmetic, but it helps a bit. >> I will change Callgrind in the same way. >> >> But this mail is meant for two patches on top, speeding up >> Cachegrind by 30% on average, as attached. >> Both introduce special cases of the Ir event, and each improves >> Cachegrind by around 15%. >> >> * Most Ir's do not cross cache lines, so one can move work to >> instrumentation time. It changes the semantic of Ir event to be the >> special, but common case, and adds "IrG" as fallback - G for "general >> case". (IrG.patch). >> >> * Of the Ir's touching only one cache lines, quite some go into >> the same line, and do not change cache state. Thus, the simulation >> call can be avoided. I call that IrS, S for "sequential access". >> As it makes sense to combine both Ir and IrS with Dr/Dw etc, this >> adds quite some handlers for combined events, but a few handlers >> could be removed (IrS.patch). >> >> Then I have a 3rd one reducing parameters for dirty helpers, which >> seems to be beneficial for x86 (32bit). For x86_64 it does not matter, >> but it always reduces the generated code size (pars.patch). >> >> >> What do you think? >> >> Josef > > |
|
From: Julian S. <js...@ac...> - 2012-10-11 13:57:00
|
> > All looks fine to me. This is an implementation-only change, yes? > > Yes. > At least, it is expected to be. > > That's actually a tricky problem. When the cachegrind binary changes, > also a lot of client addresses will change, and there can be new/other > cache conflicts misses. I am a bit surprised by that, since the cachegrind binary is loaded at 0x38000000, which is far away from the load address for the client (0x400000) and its libraries. Are you sure you are not seeing the effect of ASLR in the client? Just a guess. > Sounds good. > Actually, I did not do "Ir -> IrNoX" (or similar) to reduce patch size. Storage is cheap these days :-) I prefer readability over minimising patch size :) > As far as I understand, for any Ir which is not the first in a SB or > after a side exit, the number of executions must be the same as for the > previous Ir. So a post-processing should be able to correct the > counts. Yes, I think that is correct. I think an equivalent way to say it is: if you partition the IR sequence for a SB at the side-exit points, then each sub-sequence you get from this is a single-entry, single-exit block, and so every IR statement within the block (hence, every guest insn in it) is executed the same number of times. Does that sync with your understanding? J |