|
From: Josef W. <Jos...@gm...> - 2011-11-22 00:41:49
|
On 21.11.2011 21:52, Philippe Waroquiers wrote: >>> So, patch looks good for performance on these systems. >> I did not expect much change at all, so that's good. > I re-runned on ppc64, with --reps=10. This confirms the patch is positive (all the > tests are faster with --reps=10, including ffbench). Good to know. >> It probably makes only sense to apply this patch if I can come up with >> some real optimization. > I understand the code with the patch is nicer (no "huge ugly macro" anymore) > and it is faster. > So, even if you cannot make it even faster, this looks a good thing to apply in any case. Attached two other patches, on top of the previous one: (1) cg-tune.patch: add LIKELY into simulator where it makes sense, and use block numbers as tags (this is always possible) (2) IrX.patch: regular Ir events will never cross cache lines, which allows faster simulation. Use IrX as generic case (rarely). Together (with the macro removal patch), this gives me perl valgrind/perf/vg_perf --vg=valgrind --vg=vg-cgopt --reps=2 --tools=cachegrind valgrind/perf -- Running tests in valgrind/perf ------------------------------------- -- bigcode1 -- bigcode1 valgrind :0.13s ca: 6.9s (53.4x, -----) bigcode1 vg-cgopt :0.13s ca: 5.7s (43.7x, 18.2%) -- bigcode2 -- bigcode2 valgrind :0.14s ca:10.8s (76.9x, -----) bigcode2 vg-cgopt :0.14s ca: 9.7s (69.1x, 10.1%) -- bz2 -- bz2 valgrind :0.66s ca:19.0s (28.8x, -----) bz2 vg-cgopt :0.66s ca:14.7s (22.3x, 22.6%) -- fbench -- fbench valgrind :0.28s ca: 5.4s (19.4x, -----) fbench vg-cgopt :0.28s ca: 4.3s (15.4x, 20.6%) -- ffbench -- ffbench valgrind :0.25s ca: 6.2s (24.8x, -----) ffbench vg-cgopt :0.25s ca: 5.0s (20.0x, 19.5%) -- heap -- heap valgrind :0.10s ca: 5.7s (57.1x, -----) heap vg-cgopt :0.10s ca: 4.2s (41.9x, 26.6%) -- sarp -- sarp valgrind :0.03s ca: 1.4s (47.3x, -----) sarp vg-cgopt :0.03s ca: 1.1s (36.0x, 23.9%) -- tinycc -- tinycc valgrind :0.20s ca:13.1s (65.5x, -----) tinycc vg-cgopt :0.20s ca:11.4s (56.8x, 13.3%) The IrX.patch above is an example of making use of information we know at instrumentation time: that the guest instruction address and length. While data addresses are variable, the length still is fixed, and it could help to instrument a quick check for not crossing cache lines... Another candidat seems to be the access counters incrementing all the time (and the indirection to get to the counters). It should be possible to do one counter per side exit, and add up the final counters at the end. Josef |