|
From: Josef W. <Jos...@gm...> - 2011-11-18 16:16:59
Attachments:
s.patch
|
Hi, I am currently playing with different strategies to make the cache simulator faster (not topic of this email). For that, the ugly huge macro currently used in cachegrind makes it a little difficult. The attached patch converts the simulation routine from using the macro into regular functions to be inlined by the compiler. There is absolutely nothing changed otherwise. For my system (gcc 4.6.1, amd64), it actually gets a little bit faster some times. I have to say that the results are a bit unstable between runs. I would be interested if this is similar on other systems. Before: Valgrind 3.7.0: > perl perf/vg_perf --tools=cachegrind perf/ -- Running tests in perf ---------------------------------------------- bigcode1 valgrind :0.14s ca: 7.1s (50.5x, -----) bigcode2 valgrind :0.13s ca:10.9s (84.2x, -----) bz2 valgrind :0.67s ca:19.2s (28.7x, -----) fbench valgrind :0.29s ca: 5.5s (19.1x, -----) ffbench valgrind :0.26s ca: 6.2s (24.0x, -----) heap valgrind :0.09s ca: 5.8s (64.9x, -----) sarp valgrind :0.04s ca: 1.5s (37.5x, -----) tinycc valgrind :0.24s ca:13.5s (56.2x, -----) -- Finished tests in perf ---------------------------------------------- With attached patch applied: > perl perf/vg_perf --tools=cachegrind perf/ -- Running tests in perf ---------------------------------------------- bigcode1 valgrind :0.15s ca: 6.9s (45.7x, -----) bigcode2 valgrind :0.15s ca:10.9s (72.5x, -----) bz2 valgrind :0.66s ca:19.9s (30.1x, -----) fbench valgrind :0.28s ca: 5.5s (19.5x, -----) ffbench valgrind :0.27s ca: 6.5s (24.2x, -----) heap valgrind :0.11s ca: 5.7s (52.1x, -----) sarp valgrind :0.04s ca: 1.4s (34.5x, -----) tinycc valgrind :0.22s ca:13.4s (60.7x, -----) -- Finished tests in perf ---------------------------------------------- Josef |
|
From: Philippe W. <phi...@sk...> - 2011-11-18 22:24:18
|
> For my system (gcc 4.6.1, amd64), it actually gets a little bit faster > some times. > I have to say that the results are a bit unstable between runs. cpu freq scaling ? > I would be interested if this is similar on other systems. On ppc64/fedora16/gcc 4.6.2, all tests are between 0.4 and 7.4% faster with the patch. On x86/fedora12/gcc 4.4.4, all tests are between 1.2 and 4.9% faster, except ffbench (1.7% slower). (a very old Pentium 4, 3GHz) So, patch looks good for performance on these systems. Details below. Philippe gcc 4.6.2 ppc64 fedora 16 perl perf/vg_perf --reps=2 --tools=cachegrind perf --vg=../trunk_untouched --vg=../jw_patch -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 trunk_untouched:0.22s ca: 9.6s (43.4x, -----) bigcode1 jw_patch :0.22s ca: 8.8s (40.2x, 7.4%) -- bigcode2 -- bigcode2 trunk_untouched:0.22s ca: 9.5s (43.4x, -----) bigcode2 jw_patch :0.22s ca: 8.9s (40.3x, 7.1%) -- bz2 -- bz2 trunk_untouched:0.86s ca:30.5s (35.5x, -----) bz2 jw_patch :0.86s ca:29.6s (34.5x, 2.8%) -- fbench -- fbench trunk_untouched:0.38s ca: 9.4s (24.8x, -----) fbench jw_patch :0.38s ca: 9.4s (24.7x, 0.4%) -- ffbench -- ffbench trunk_untouched:0.44s ca: 8.5s (19.4x, -----) ffbench jw_patch :0.44s ca: 8.1s (18.5x, 4.7%) -- heap -- heap trunk_untouched:0.40s ca:14.6s (36.5x, -----) heap jw_patch :0.40s ca:14.0s (35.1x, 3.9%) -- sarp -- sarp trunk_untouched:0.03s ca: 2.0s (67.0x, -----) sarp jw_patch :0.03s ca: 2.0s (66.0x, 1.5%) -- tinycc -- tinycc trunk_untouched:0.28s ca:18.5s (66.2x, -----) tinycc jw_patch :0.28s ca:18.0s (64.2x, 3.1%) -- Finished tests in perf ---------------------------------------------- == 8 programs, 16 timings ================= x86 fedora 12 gcc 4.4.4 perl perf/vg_perf --reps=2 --tools=cachegrind perf --vg=../trunk_untouched --vg=../jw_patch -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 trunk_untouched:0.18s ca:24.3s (135.0x, -----) bigcode1 jw_patch :0.18s ca:23.2s (129.2x, 4.3%) -- bigcode2 -- bigcode2 trunk_untouched:0.19s ca:32.5s (171.3x, -----) bigcode2 jw_patch :0.19s ca:31.0s (162.9x, 4.9%) -- bz2 -- bz2 trunk_untouched:1.18s ca:73.6s (62.3x, -----) bz2 jw_patch :1.18s ca:71.8s (60.8x, 2.4%) -- fbench -- fbench trunk_untouched:0.64s ca:24.0s (37.5x, -----) fbench jw_patch :0.64s ca:23.7s (37.0x, 1.2%) -- ffbench -- ffbench trunk_untouched:2.13s ca:24.7s (11.6x, -----) ffbench jw_patch :2.13s ca:25.1s (11.8x, -1.7%) -- heap -- heap trunk_untouched:0.20s ca:24.7s (123.5x, -----) heap jw_patch :0.20s ca:24.0s (119.8x, 2.9%) -- sarp -- sarp trunk_untouched:0.05s ca: 5.4s (107.2x, -----) sarp jw_patch :0.05s ca: 5.2s (104.0x, 3.0%) -- tinycc -- tinycc trunk_untouched:0.39s ca:54.0s (138.5x, -----) tinycc jw_patch :0.39s ca:52.4s (134.4x, 2.9%) -- Finished tests in perf ---------------------------------------------- == 8 programs, 16 timings ================= |
|
From: Josef W. <Jos...@gm...> - 2011-11-21 12:52:47
|
On 18.11.2011 23:24, Philippe Waroquiers wrote: >> For my system (gcc 4.6.1, amd64), it actually gets a little bit >> faster some times. >> I have to say that the results are a bit unstable between runs. > cpu freq scaling ? Probably. Together with my dual-core... my lazyness. Thanks for pointing me to the --reps and --vg options. >> I would be interested if this is similar on other systems. > > On ppc64/fedora16/gcc 4.6.2, all tests are between 0.4 and 7.4% faster > with the patch. > On x86/fedora12/gcc 4.4.4, all tests are between 1.2 and 4.9% faster, > except ffbench (1.7% slower). > (a very old Pentium 4, 3GHz) > > So, patch looks good for performance on these systems. I did not expect much change at all, so that's good. It probably makes only sense to apply this patch if I can come up with some real optimization. Thanks, Josef > > Details below. > > Philippe > > > gcc 4.6.2 ppc64 fedora 16 > perl perf/vg_perf --reps=2 --tools=cachegrind perf > --vg=../trunk_untouched --vg=../jw_patch > -- Running tests in perf ---------------------------------------------- > -- bigcode1 -- > bigcode1 trunk_untouched:0.22s ca: 9.6s (43.4x, -----) > bigcode1 jw_patch :0.22s ca: 8.8s (40.2x, 7.4%) > -- bigcode2 -- > bigcode2 trunk_untouched:0.22s ca: 9.5s (43.4x, -----) > bigcode2 jw_patch :0.22s ca: 8.9s (40.3x, 7.1%) > -- bz2 -- > bz2 trunk_untouched:0.86s ca:30.5s (35.5x, -----) > bz2 jw_patch :0.86s ca:29.6s (34.5x, 2.8%) > -- fbench -- > fbench trunk_untouched:0.38s ca: 9.4s (24.8x, -----) > fbench jw_patch :0.38s ca: 9.4s (24.7x, 0.4%) > -- ffbench -- > ffbench trunk_untouched:0.44s ca: 8.5s (19.4x, -----) > ffbench jw_patch :0.44s ca: 8.1s (18.5x, 4.7%) > -- heap -- > heap trunk_untouched:0.40s ca:14.6s (36.5x, -----) > heap jw_patch :0.40s ca:14.0s (35.1x, 3.9%) > -- sarp -- > sarp trunk_untouched:0.03s ca: 2.0s (67.0x, -----) > sarp jw_patch :0.03s ca: 2.0s (66.0x, 1.5%) > -- tinycc -- > tinycc trunk_untouched:0.28s ca:18.5s (66.2x, -----) > tinycc jw_patch :0.28s ca:18.0s (64.2x, 3.1%) > -- Finished tests in perf ---------------------------------------------- > > == 8 programs, 16 timings ================= > > x86 fedora 12 gcc 4.4.4 > perl perf/vg_perf --reps=2 --tools=cachegrind perf > --vg=../trunk_untouched --vg=../jw_patch > -- Running tests in perf ---------------------------------------------- > -- bigcode1 -- > bigcode1 trunk_untouched:0.18s ca:24.3s (135.0x, -----) > bigcode1 jw_patch :0.18s ca:23.2s (129.2x, 4.3%) > -- bigcode2 -- > bigcode2 trunk_untouched:0.19s ca:32.5s (171.3x, -----) > bigcode2 jw_patch :0.19s ca:31.0s (162.9x, 4.9%) > -- bz2 -- > bz2 trunk_untouched:1.18s ca:73.6s (62.3x, -----) > bz2 jw_patch :1.18s ca:71.8s (60.8x, 2.4%) > -- fbench -- > fbench trunk_untouched:0.64s ca:24.0s (37.5x, -----) > fbench jw_patch :0.64s ca:23.7s (37.0x, 1.2%) > -- ffbench -- > ffbench trunk_untouched:2.13s ca:24.7s (11.6x, -----) > ffbench jw_patch :2.13s ca:25.1s (11.8x, -1.7%) > -- heap -- > heap trunk_untouched:0.20s ca:24.7s (123.5x, -----) > heap jw_patch :0.20s ca:24.0s (119.8x, 2.9%) > -- sarp -- > sarp trunk_untouched:0.05s ca: 5.4s (107.2x, -----) > sarp jw_patch :0.05s ca: 5.2s (104.0x, 3.0%) > -- tinycc -- > tinycc trunk_untouched:0.39s ca:54.0s (138.5x, -----) > tinycc jw_patch :0.39s ca:52.4s (134.4x, 2.9%) > -- Finished tests in perf ---------------------------------------------- > > == 8 programs, 16 timings ================= > > > |
|
From: Philippe W. <phi...@sk...> - 2011-11-21 20:52:40
|
>> So, patch looks good for performance on these systems. > > I did not expect much change at all, so that's good. I re-runned on ppc64, with --reps=10. This confirms the patch is positive (all the tests are faster with --reps=10, including ffbench). > > It probably makes only sense to apply this patch if I can come up with > some real optimization. I understand the code with the patch is nicer (no "huge ugly macro" anymore) and it is faster. So, even if you cannot make it even faster, this looks a good thing to apply in any case. Philippe |
|
From: Philippe W. <phi...@sk...> - 2011-11-22 20:16:01
|
> Together (with the macro removal patch), this gives me I obtain the below improvements on ppc64. Philippe perl perf/vg_perf --reps=10 --tools=cachegrind perf --vg=../trunk_untouched --vg=../jw_patch 2>&1 | tee perf_3patch.out -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 trunk_untouched:0.22s ca: 9.4s (42.8x, -----) bigcode1 jw_patch :0.22s ca: 7.2s (32.8x, 23.4%) -- bigcode2 -- bigcode2 trunk_untouched:0.22s ca: 9.4s (42.9x, -----) bigcode2 jw_patch :0.22s ca: 7.2s (32.7x, 23.8%) -- bz2 -- bz2 trunk_untouched:0.86s ca:30.4s (35.4x, -----) bz2 jw_patch :0.86s ca:26.1s (30.4x, 14.1%) -- fbench -- fbench trunk_untouched:0.37s ca: 9.4s (25.5x, -----) fbench jw_patch :0.37s ca: 8.1s (21.9x, 14.2%) -- ffbench -- ffbench trunk_untouched:0.43s ca: 8.4s (19.6x, -----) ffbench jw_patch :0.43s ca: 7.3s (17.0x, 13.3%) -- heap -- heap trunk_untouched:0.39s ca:14.5s (37.3x, -----) heap jw_patch :0.39s ca:12.6s (32.3x, 13.5%) -- sarp -- sarp trunk_untouched:0.03s ca: 2.0s (67.3x, -----) sarp jw_patch :0.03s ca: 1.8s (59.3x, 11.9%) -- tinycc -- tinycc trunk_untouched:0.28s ca:18.4s (65.7x, -----) tinycc jw_patch :0.28s ca:16.4s (58.7x, 10.5%) -- Finished tests in perf ---------------------------------------------- == 8 programs, 16 timings ================= |
|
From: Josef W. <Jos...@gm...> - 2011-11-22 00:41:49
Attachments:
blocks.patch
IrX.patch
|
On 21.11.2011 21:52, Philippe Waroquiers wrote: >>> So, patch looks good for performance on these systems. >> I did not expect much change at all, so that's good. > I re-runned on ppc64, with --reps=10. This confirms the patch is positive (all the > tests are faster with --reps=10, including ffbench). Good to know. >> It probably makes only sense to apply this patch if I can come up with >> some real optimization. > I understand the code with the patch is nicer (no "huge ugly macro" anymore) > and it is faster. > So, even if you cannot make it even faster, this looks a good thing to apply in any case. Attached two other patches, on top of the previous one: (1) cg-tune.patch: add LIKELY into simulator where it makes sense, and use block numbers as tags (this is always possible) (2) IrX.patch: regular Ir events will never cross cache lines, which allows faster simulation. Use IrX as generic case (rarely). Together (with the macro removal patch), this gives me perl valgrind/perf/vg_perf --vg=valgrind --vg=vg-cgopt --reps=2 --tools=cachegrind valgrind/perf -- Running tests in valgrind/perf ------------------------------------- -- bigcode1 -- bigcode1 valgrind :0.13s ca: 6.9s (53.4x, -----) bigcode1 vg-cgopt :0.13s ca: 5.7s (43.7x, 18.2%) -- bigcode2 -- bigcode2 valgrind :0.14s ca:10.8s (76.9x, -----) bigcode2 vg-cgopt :0.14s ca: 9.7s (69.1x, 10.1%) -- bz2 -- bz2 valgrind :0.66s ca:19.0s (28.8x, -----) bz2 vg-cgopt :0.66s ca:14.7s (22.3x, 22.6%) -- fbench -- fbench valgrind :0.28s ca: 5.4s (19.4x, -----) fbench vg-cgopt :0.28s ca: 4.3s (15.4x, 20.6%) -- ffbench -- ffbench valgrind :0.25s ca: 6.2s (24.8x, -----) ffbench vg-cgopt :0.25s ca: 5.0s (20.0x, 19.5%) -- heap -- heap valgrind :0.10s ca: 5.7s (57.1x, -----) heap vg-cgopt :0.10s ca: 4.2s (41.9x, 26.6%) -- sarp -- sarp valgrind :0.03s ca: 1.4s (47.3x, -----) sarp vg-cgopt :0.03s ca: 1.1s (36.0x, 23.9%) -- tinycc -- tinycc valgrind :0.20s ca:13.1s (65.5x, -----) tinycc vg-cgopt :0.20s ca:11.4s (56.8x, 13.3%) The IrX.patch above is an example of making use of information we know at instrumentation time: that the guest instruction address and length. While data addresses are variable, the length still is fixed, and it could help to instrument a quick check for not crossing cache lines... Another candidat seems to be the access counters incrementing all the time (and the indirection to get to the counters). It should be possible to do one counter per side exit, and add up the final counters at the end. Josef |
|
From: John R. <jr...@bi...> - 2011-11-22 05:50:08
|
> Another candidat seems to be the access counters incrementing all the time > (and the indirection to get to the counters). It should be possible to do one > counter per side exit, and add up the final counters at the end. Unfortunately in general Kirchhoff's law does _NOT_ apply to software because of things like setjmp/longjmp, getcontext/setcontext, exit(), debuggers, etc. If you want to know how many times something is executed, then you must count that directly. -- |
|
From: Josef W. <Jos...@gm...> - 2011-11-22 11:34:47
|
On 22.11.2011 06:50, John Reiser wrote: >> Another candidat seems to be the access counters incrementing all the time >> (and the indirection to get to the counters). It should be possible to do one >> counter per side exit, and add up the final counters at the end. > Unfortunately in general Kirchhoff's law does _NOT_ apply to software > because of things like setjmp/longjmp, getcontext/setcontext, exit(), > debuggers, etc. If you want to know how many times something is executed, > then you must count that directly. Hmm. You are right, one has to be careful. For setjmp/longjmp, getcontext/setcontext, and exit(), they are guest instructions doing a control flow change, so they should result in regular (side) exits of a superblock. For these, I do not see a problem. Signals are delivered between execution of superblocks, so these should be fine as well. The problematic case for above strategy are exceptions/traps. In that case, we would not count the executions in a block up to the exception. Hmm.. Cachegrind/Callgrind/Lackey all are buggy in that regard, as they delay the handling of memory accesses by instrumenting callbacks directly before side exits. This not only saves time because of getting rid of saves/restores of registers around each callback (there is nothing to save/restore between multiple callbacks in a row), but more importantly allows merging the events to reduce the number of callbacks. So by incrementing a counter before side exits should give the same result as currently: we already miss simulator calls on exceptions raised in the middle of a block :-( It seems that a tool can catch exceptions. There, it should be able to write compensation code. But to call the missed simulator calls, I am not sure how to get at the arguments saved in temporary registers in the instrumented block ... Josef > |
|
From: Josef W. <Jos...@gm...> - 2011-11-23 15:46:34
|
On 23.11.2011 08:34, Julian Seward wrote: > Overall I'd just forget about any loss of precision from exceptions. > In order for it to be noticeable would require the program to enter > a signal handler following an exception, literally thousands of times. > Which doesn't sound very likely. I agree. Hmm. I assume we could come up with an simple estimation on how much simulator calls we could have missed due to exceptions, and if its above a given threeshold, print out a warning about it. Josef |
|
From: Josef W. <Jos...@gm...> - 2011-11-23 16:18:45
|
On 23.11.2011 08:39, Julian Seward wrote: > On Tuesday, November 22, 2011, Philippe Waroquiers wrote: >>> Together (with the macro removal patch), this gives me >> I obtain the below improvements on ppc64. [...] > Looks good to me. My goal here actually was to make the common case for instruction fetches (hit the MRU tag in I1) as fast as possible. One remaining obstacle is incrementing the access counter. If we can avoid that, we directly could instrument the MRU hit check for Ir. Is there a possibility to pass more than 3 parameters to a C call? Perhaps via shadow registers? Background: I really would like to be able to pass the memory block number for the Ir access not crossing cache line boundaries directly as parameter. We can calculate that at instrumentation time, so no need to do it in the simulator again and again from cache parameters, which actually are constant. Hmm... Valgrind has this nice code generator, but we "only" use it for instrumentation. It would be really cool to use VEX to generate the inner most cache simulation routine for given cache parameters (esp. unroll that loop for the fixed associativity), and call that from the C callback. Do you see a way to accomplish that? > FWIW, yes, I have also had much fun and games :-) > trying to get repeatable performance numbers on top end CPUs. I found > two things; firstly that even small numbers of other tasks running > (daemons, etc) generate a surprisingly large amount of measurement > noise, so a dedicated test machine is really worth having [and, also, > measurement in a VM is hopeless]. Secondly I found that the most > consistent numbers come from the least microarchitecturally complex > CPUs. So .. I get the most reliable numbers from my ARM Cortex-A8 > beagleboard. Interesting! Josef |
|
From: Josef W. <Jos...@gm...> - 2011-11-24 11:29:07
|
On 24.11.2011 10:12, Julian Seward wrote: > On Wednesday, November 23, 2011, Josef Weidendorfer wrote: > >> My goal here actually was to make the common case for instruction >> fetches (hit the MRU tag in I1) as fast as possible. One remaining obstacle >> is incrementing the access counter. If we can avoid that, we directly >> could instrument the MRU hit check for Ir. > Sounds good. /me is not claiming to understand all the details. One thing; > you know you can do conditional dirty helper calls, yes? Yes. I am just not sure yet how to mix that with event merging. It could be that instrumenting the MRU hit check is not worth it. > >> Is there a possibility to pass more than 3 parameters to a C call? > Mmh, yes. Why do you think it is limited to 3 params? Ah, good. Probably I had this impression because I never saw a dirty helper call with more parameters. > FWIW I think > all the backends can handle at least 4 word-sized parameters; maybe > more in some some cases (of course, that does not help you since you're > limited here to what the least capable backend can do.) > >> Hmm... Valgrind has this nice code generator, but we "only" use it for >> instrumentation. It would be really cool to use VEX to generate the inner >> most cache simulation routine for given cache parameters (esp. unroll >> that loop for the fixed associativity), and call that from the C callback. >> Do you see a way to accomplish that? > I'm sure it's doable, but it's not a half-a-day kind of hack. It > would require some messing with infrastructure. I'd need to think > about it. > > Can you get anywhere by using the C preprocessor to generate multiple > partially specialised copies of the cache simulation and adding calls > just to the relevant versions (specialised by associativity, whatever, > etc?) Could work. Hmm.. as I don't want a switch statement for these special cases in every simulator call, but already want this worked out at instrumentation time, this results in a lot of different dirty helpers. I need to play with that. The benefit of the generated code is that I always can call the generated partial simulation, as only one cache parameter set is needed at a time, without duplicating the helper. Anyway, it should be easy to just make a special case for my Core i5 laptop, and see if there is any benefit at all. Thanks! Josef |
|
From: Julian S. <js...@ac...> - 2011-11-24 11:37:21
|
> Hmm.. as I don't want a switch statement for these special cases in > every simulator call, but already want this worked out at instrumentation > time, this results in a lot of different dirty helpers. I need to play > with that. Well, I was not thinking of having switches once per sim call, rather jitting calls to specialised versions. But I don't understand this well enough to comment, so ignore me .. If you're worried about getting too many mispredicts, you might like to use cachegrind (or callgrind) to profile callgrind; could be useful. > Anyway, it should be easy to just make a special case for my Core i5 > laptop, and see if there is any benefit at all. +1 for that plan. J |
|
From: Josef W. <Jos...@gm...> - 2011-11-28 23:38:11
|
On 24.11.2011 12:29, Julian Seward wrote: >> Hmm.. as I don't want a switch statement for these special cases in >> every simulator call, but already want this worked out at instrumentation >> time, this results in a lot of different dirty helpers. I need to play >> with that. > Well, I was not thinking of having switches once per sim call, rather > jitting calls to specialised versions. Ok. Still, lots of helpers. > If you're worried about getting too many mispredicts, you might like > to use cachegrind (or callgrind) to profile callgrind; could be useful. ;-) >> Anyway, it should be easy to just make a special case for my Core i5 >> laptop, and see if there is any benefit at all. > +1 for that plan. Unfortunately, special casing the simulator for specific cache parameters was not really giving me a visible speedup. So I did not investigate further. See another mail on experiments with direct instrumentation. Josef |