|
From: Nicholas N. <nj...@ca...> - 2003-04-07 14:50:53
|
Hi,
I just committed a change to the CVS head that speeds up the Memcheck and
Addrcheck skins a lot: by up to 28% and 36% respectively. Here are some
figures for the SPEC2000 benchmark suite, using the "test" inputs:
========
memcheck
========
164.gzip: vg1: 46.00s, vg2: 37.12s, speed-up: 19.3%
300.twolf: vg1: 7.34s, vg2: 6.63s, speed-up: 9.7%
197.parser: vg1: 70.60s, vg2: 57.24s, speed-up: 18.9%
186.crafty: vg1: 166.83s, vg2: 156.80s, speed-up: 6.0%
255.vortex: vg1: 392.94s, vg2: 283.37s, speed-up: 27.9%
256.bzip2: vg1: 152.97s, vg2: 141.28s, speed-up: 7.6%
176.gcc: vg1: 60.27s, vg2: 52.06s, speed-up: 13.6%
181.mcf: vg1: 4.24s, vg2: 4.11s, speed-up: 3.1%
254.gap: vg1: 31.40s, vg2: 24.73s, speed-up: 21.2%
--------
179.art: vg1: 364.32s, vg2: 373.53s, speed-up: -2.5%
183.equake: vg1: 68.86s, vg2: 63.54s, speed-up: 7.7%
188.ammp: vg1: 463.18s, vg2: 455.46s, speed-up: 1.7%
177.mesa: vg1: 113.69s, vg2: 100.01s, speed-up: 12.0%
========
addrcheck
========
164.gzip: vg1: 35.97s, vg2: 27.37s, speed-up: 23.9%
300.twolf: vg1: 5.18s, vg2: 4.48s, speed-up: 13.5%
197.parser: vg1: 57.92s, vg2: 40.95s, speed-up: 29.3%
186.crafty: vg1: 119.23s, vg2: 93.99s, speed-up: 21.2%
255.vortex: vg1: 349.38s, vg2: 223.08s, speed-up: 36.1%
256.bzip2: vg1: 105.23s, vg2: 90.89s, speed-up: 13.6%
176.gcc: vg1: 43.52s, vg2: 33.97s, speed-up: 21.9%
181.mcf: vg1: 2.20s, vg2: 2.07s, speed-up: 5.9%
254.gap: vg1: 22.90s, vg2: 17.82s, speed-up: 22.2%
--------
179.art: vg1: 298.79s, vg2: 294.92s, speed-up: 1.3%
183.equake: vg1: 59.73s, vg2: 57.94s, speed-up: 3.0%
188.ammp: vg1: 404.55s, vg2: 399.96s, speed-up: 1.1%
177.mesa: vg1: 92.93s, vg2: 79.65s, speed-up: 14.3%
For those of you interested in the details: the speed-up was possible
because the skins' handling of %esp updates was decidedly sub-optimal...
every %esp adjustment requires updating the accessibility bits of the
relevant stack words. This was handled with the core events
{new,die}_stack_mem{_aligned,}, after the core computed the %esp-delta at
run-time. And %esp is updated very frequently.
But most of the time the %esp-delta can be computed at compile time, which
saves an extra function call. Now %esp updates are handled with
{new,die}_stack_mem, and optionally the specialised versions
{new,die}_stack_mem_{4,8,12,16,32}. The core uses these specialised
versions if they are present, or falls back to the generic version if not,
or if the delta cannot be determined at compile time. Addrcheck and
Memcheck have unrolled-loop versions for the specialised cases.
Thanks to Julian for pointing out the inefficiency in the old approach.
I haven't tested this super-thoroughly, so I would appreciate it if others
could try it out. I would also be interested to hear what kind of
speed-ups others get on different machines, environments, etc.
Also, since a few things got moved around in the source, this might cause
some CVS clashes with those of actively hacking Valgrind; apologies if
so, but I hope you agree the performance improvements are worth it.
N
|