From: Nicholas N. <nj...@ca...> - 2003-04-07 14:50:53
|
Hi, I just committed a change to the CVS head that speeds up the Memcheck and Addrcheck skins a lot: by up to 28% and 36% respectively. Here are some figures for the SPEC2000 benchmark suite, using the "test" inputs: ======== memcheck ======== 164.gzip: vg1: 46.00s, vg2: 37.12s, speed-up: 19.3% 300.twolf: vg1: 7.34s, vg2: 6.63s, speed-up: 9.7% 197.parser: vg1: 70.60s, vg2: 57.24s, speed-up: 18.9% 186.crafty: vg1: 166.83s, vg2: 156.80s, speed-up: 6.0% 255.vortex: vg1: 392.94s, vg2: 283.37s, speed-up: 27.9% 256.bzip2: vg1: 152.97s, vg2: 141.28s, speed-up: 7.6% 176.gcc: vg1: 60.27s, vg2: 52.06s, speed-up: 13.6% 181.mcf: vg1: 4.24s, vg2: 4.11s, speed-up: 3.1% 254.gap: vg1: 31.40s, vg2: 24.73s, speed-up: 21.2% -------- 179.art: vg1: 364.32s, vg2: 373.53s, speed-up: -2.5% 183.equake: vg1: 68.86s, vg2: 63.54s, speed-up: 7.7% 188.ammp: vg1: 463.18s, vg2: 455.46s, speed-up: 1.7% 177.mesa: vg1: 113.69s, vg2: 100.01s, speed-up: 12.0% ======== addrcheck ======== 164.gzip: vg1: 35.97s, vg2: 27.37s, speed-up: 23.9% 300.twolf: vg1: 5.18s, vg2: 4.48s, speed-up: 13.5% 197.parser: vg1: 57.92s, vg2: 40.95s, speed-up: 29.3% 186.crafty: vg1: 119.23s, vg2: 93.99s, speed-up: 21.2% 255.vortex: vg1: 349.38s, vg2: 223.08s, speed-up: 36.1% 256.bzip2: vg1: 105.23s, vg2: 90.89s, speed-up: 13.6% 176.gcc: vg1: 43.52s, vg2: 33.97s, speed-up: 21.9% 181.mcf: vg1: 2.20s, vg2: 2.07s, speed-up: 5.9% 254.gap: vg1: 22.90s, vg2: 17.82s, speed-up: 22.2% -------- 179.art: vg1: 298.79s, vg2: 294.92s, speed-up: 1.3% 183.equake: vg1: 59.73s, vg2: 57.94s, speed-up: 3.0% 188.ammp: vg1: 404.55s, vg2: 399.96s, speed-up: 1.1% 177.mesa: vg1: 92.93s, vg2: 79.65s, speed-up: 14.3% For those of you interested in the details: the speed-up was possible because the skins' handling of %esp updates was decidedly sub-optimal... every %esp adjustment requires updating the accessibility bits of the relevant stack words. This was handled with the core events {new,die}_stack_mem{_aligned,}, after the core computed the %esp-delta at run-time. And %esp is updated very frequently. But most of the time the %esp-delta can be computed at compile time, which saves an extra function call. Now %esp updates are handled with {new,die}_stack_mem, and optionally the specialised versions {new,die}_stack_mem_{4,8,12,16,32}. The core uses these specialised versions if they are present, or falls back to the generic version if not, or if the delta cannot be determined at compile time. Addrcheck and Memcheck have unrolled-loop versions for the specialised cases. Thanks to Julian for pointing out the inefficiency in the old approach. I haven't tested this super-thoroughly, so I would appreciate it if others could try it out. I would also be interested to hear what kind of speed-ups others get on different machines, environments, etc. Also, since a few things got moved around in the source, this might cause some CVS clashes with those of actively hacking Valgrind; apologies if so, but I hope you agree the performance improvements are worth it. N |