|
From: Philippe W. <phi...@sk...> - 2017-11-11 22:27:41
|
This mail gives some measurements of the perf impact of using link time
optimisations when building valgrind with lto (NB: some hacks
documented below were used to build with -flto).
A summary of the perf impact is:
* callgrind : all perf tests are faster (between 5 to 10%).
* memcheck : many tests are faster, some are equal, one degraded
(I retried this one later, there was then no degradation).
* helgrind : many tests are faster, a few are slower.
The regression tests seem basically ok (some 30 failures mostly
due to some stacktraces differences, as the tests were also
compiled with -flto).
This experiment was done on Debian 9/amd64, gcc 6.3.0
The build was done using:
export LD=/usr/bin/gold
./autogen.sh
export CFLAGS="-flto -fuse-linker-plugin"
CFLAGS="$CFLAGS" ./configure --enable-only64bit --prefix=`pwd`/Inst
nice make -j4 2>&1 | tee m.out
The make then failed for a bunch of errors. These were (hackily)
bypassed:
* a compilation fails because the generation of libvex_guest_offsets.h
itself fails. IIUC, this file is generated by post-processing a .o
file, but with -flto, the .o file does not contain the relevant
information.
Bypassed by copying the .h file from a normal build.
* ar and ranlib commands are failing, complaining about a missing
plugin.
Bypassed by editing manually coregrind/Makefile and VEX/Makefile,
replacing AR = /usr/bin/ar by AR = gcc-ar and RANLIB = ranlib
by RANLIB = gcc-ranlib
* then linking of the tools was failing due to unknown symbols
VG_MINIMAL_SETJMP and VG_MINIMAL_LONGJMP.
Bypassed by copying libcoregrind_amd64_linux_a-m_libcsetjmp.o
from a normal build, and then making again
make libcoregrind-amd64-linux.a
* linker complained that it could not find _start, and set
a default (non working) start address.
Bypassed by copying libcoregrind_amd64_linux_a-m_main.o
from a normal build, rebuilding the coregring library again,
and relaunching make.
I guess it should not be too difficult to fix properly the
above in the build system (e.g. by not using -flto for
the 3 files causing problems and for the tests).
There are some drawbacks to using -flto: link time is significantly
longer, as the code generation happens mostly during link, and so
is repeated for each tool. The installed coregrind/VEX libs
are also using lto, which is not usable for users building
their own tools based on the VEX lib.
So, we for sure need --enable-lto configure option (off by default?)
and maybe even with lto, we might better compile
the libraries and the tools once without lto, and once with
lto (and e.g. have a --lto=yes|no option for valgrind to
choose which version of the tool to use).
Feedback ?
Philippe
perl perf/vg_perf --vg=../trunk_untouched --vg=../smallthing --tools=none,memcheck,helgrind,callgrind --reps=5 perf/ |& tee perf.out
-- Running tests in perf ----------------------------------------------
-- bigcode1 --
bigcode1 trunk_untouched:0.07s no: 1.2s (17.4x, -----) me: 2.2s (32.1x, -----) he: 1.7s (23.7x, -----) ca: 9.1s (129.7x, -----)
bigcode1 smallthing:0.07s no: 1.2s (17.1x, 1.6%) me: 2.2s (32.0x, 0.4%) he: 1.6s (23.4x, 1.2%) ca: 8.3s (119.1x, 8.1%)
-- bigcode2 --
bigcode2 trunk_untouched:0.07s no: 2.5s (35.7x, -----) me: 5.1s (72.3x, -----) he: 3.2s (46.0x, -----) ca:18.8s (269.1x, -----)
bigcode2 smallthing:0.07s no: 2.5s (35.1x, 1.6%) me: 5.0s (71.4x, 1.2%) he: 3.2s (45.1x, 1.9%) ca:18.0s (257.7x, 4.2%)
-- bz2 --
bz2 trunk_untouched:0.43s no: 1.5s ( 3.5x, -----) me: 4.5s (10.5x, -----) he: 6.7s (15.6x, -----) ca:10.4s (24.2x, -----)
bz2 smallthing:0.43s no: 1.5s ( 3.5x, -0.7%) me: 4.4s (10.2x, 2.7%) he: 6.5s (15.2x, 2.2%) ca: 9.3s (21.7x, 10.1%)
-- fbench --
fbench trunk_untouched:0.14s no: 0.8s ( 5.9x, -----) me: 2.8s (19.8x, -----) he: 1.9s (13.4x, -----) ca: 4.0s (28.4x, -----)
fbench smallthing:0.14s no: 0.8s ( 5.9x, 0.0%) me: 2.8s (19.8x, 0.0%) he: 1.8s (12.8x, 4.8%) ca: 3.5s (25.2x, 11.3%)
-- ffbench --
ffbench trunk_untouched:0.15s no: 0.9s ( 5.9x, -----) me: 2.6s (17.2x, -----) he: 3.4s (22.8x, -----) ca: 1.5s (10.2x, -----)
ffbench smallthing:0.15s no: 0.9s ( 5.9x, 0.0%) me: 2.6s (17.2x, 0.0%) he: 3.3s (22.3x, 2.3%) ca: 1.4s ( 9.6x, 5.9%)
-- heap --
heap trunk_untouched:0.05s no: 0.6s (11.8x, -----) me: 3.7s (73.2x, -----) he: 5.0s (100.6x, -----) ca: 4.9s (98.0x, -----)
heap smallthing:0.05s no: 0.6s (11.8x, 0.0%) me: 3.5s (69.2x, 5.5%) he: 5.2s (104.0x, -3.4%) ca: 4.3s (86.4x, 11.8%)
-- heap_pdb4 --
heap_pdb4 trunk_untouched:0.06s no: 0.6s (10.5x, -----) me: 5.9s (98.2x, -----) he: 5.7s (94.3x, -----) ca: 5.2s (87.3x, -----)
heap_pdb4 smallthing:0.06s no: 0.6s (10.7x, -1.6%) me: 5.5s (91.5x, 6.8%) he: 5.8s (96.0x, -1.8%) ca: 4.7s (78.8x, 9.7%)
-- many-loss-records --
many-loss-records trunk_untouched:0.01s no: 0.2s (22.0x, -----) me: 1.0s (104.0x, -----) he: 0.8s (83.0x, -----) ca: 0.8s (77.0x, -----)
many-loss-records smallthing:0.01s no: 0.2s (21.0x, 4.5%) me: 0.9s (94.0x, 9.6%) he: 0.9s (89.0x, -7.2%) ca: 0.7s (70.0x, 9.1%)
-- many-xpts --
many-xpts trunk_untouched:0.02s no: 0.3s (13.5x, -----) me: 1.2s (58.0x, -----) he: 1.4s (69.5x, -----) ca: 1.9s (94.0x, -----)
many-xpts smallthing:0.02s no: 0.3s (13.0x, 3.7%) me: 1.1s (53.5x, 7.8%) he: 1.4s (71.0x, -2.2%) ca: 1.6s (82.0x, 12.8%)
-- memrw --
memrw trunk_untouched:0.04s no: 0.4s ( 9.2x, -----) me: 0.9s (21.5x, -----) he: 2.3s (58.2x, -----) ca: 1.9s (47.0x, -----)
memrw smallthing:0.04s no: 0.3s ( 8.8x, 5.4%) me: 0.9s (22.0x, -2.3%) he: 2.2s (55.5x, 4.7%) ca: 1.7s (41.5x, 11.7%)
-- sarp --
sarp trunk_untouched:0.02s no: 0.2s (12.0x, -----) me: 1.5s (77.0x, -----) he: 3.4s (169.0x, -----) ca: 1.3s (63.0x, -----)
sarp smallthing:0.02s no: 0.2s (12.0x, 0.0%) me: 1.5s (77.0x, 0.0%) he: 3.3s (166.0x, 1.8%) ca: 1.1s (56.0x, 11.1%)
-- tinycc --
tinycc trunk_untouched:0.10s no: 0.9s ( 9.2x, -----) me: 6.7s (66.8x, -----) he: 6.7s (66.6x, -----) ca: 7.3s (72.8x, -----)
tinycc smallthing:0.10s no: 0.9s ( 9.2x, 0.0%) me: 6.5s (65.5x, 1.9%) he: 6.5s (65.1x, 2.3%) ca: 6.6s (66.0x, 9.3%)
-- Finished tests in perf ----------------------------------------------
== 12 programs, 96 timings =================
|
|
From: John R. <jr...@bi...> - 2017-11-11 23:31:15
|
On 11/11/2017 1027Z, Philippe Waroquiers wrote: > This mail gives some measurements of the perf impact of using link time > optimisations when building valgrind with lto ... It would be nice if -flto gave a report of what it did (differing from a plain compile and ordinary link), and if profiling said how much each change was worth. I expect that there would be some instances where the change could be expressed in the source code, and perhaps a case where increased developer understanding would lead to a change in implementation strategy. -- |
|
From: Ivo R. <iv...@iv...> - 2017-11-13 10:43:52
|
2017-11-11 23:27 GMT+01:00 Philippe Waroquiers <phi...@sk...>: > This mail gives some measurements of the perf impact of using link time > optimisations when building valgrind with lto (NB: some hacks > documented below were used to build with -flto). > > A summary of the perf impact is: > * callgrind : all perf tests are faster (between 5 to 10%). > * memcheck : many tests are faster, some are equal, one degraded > (I retried this one later, there was then no degradation). > * helgrind : many tests are faster, a few are slower. > > The regression tests seem basically ok (some 30 failures mostly > due to some stacktraces differences, as the tests were also > compiled with -flto). Splendid job, Philippe! Some of the problems could go away if "fat" object files were used (-ffat-lto-objects). What I am worrying now is about observability of LTO-built valgrind binaries. Every section of gcc manual says that support for debugging information with LTO is experimental and that it can produce unexpected results. What are your findings here? Were you able to get some useful information for example from Valgrind C source code, and from VEX helper functions called by generated code? I. |
|
From: Philippe W. <phi...@sk...> - 2017-11-13 22:15:10
|
On Mon, 2017-11-13 at 11:43 +0100, Ivo Raisr wrote: > 2017-11-11 23:27 GMT+01:00 Philippe Waroquiers <phi...@sk...>: > > This mail gives some measurements of the perf impact of using link time > > optimisations when building valgrind with lto (NB: some hacks > > documented below were used to build with -flto). > > > > A summary of the perf impact is: > > * callgrind : all perf tests are faster (between 5 to 10%). > > * memcheck : many tests are faster, some are equal, one degraded > > (I retried this one later, there was then no degradation). > > * helgrind : many tests are faster, a few are slower. > > > > The regression tests seem basically ok (some 30 failures mostly > > due to some stacktraces differences, as the tests were also > > compiled with -flto). > > Splendid job, Philippe! > > Some of the problems could go away if "fat" object files were used > (-ffat-lto-objects). Yes, providing fat objects will allow to produce only one library version, usable to link with or without lto. > > What I am worrying now is about observability of LTO-built valgrind binaries. > Every section of gcc manual says that support for debugging information with LTO > is experimental and that it can produce unexpected results. > What are your findings here? Were you able to get some useful information > for example from Valgrind C source code, and from VEX helper functions > called by generated code? I did not had to debug anything, so I cannot really judge but I just tried now some debugging by doing --wait-for-gdb=yes, then put a few breakpoints, look at so,e variables and args, next and step. In this small experiment, I had no particular problem to debug. So, the debugging experience seems not particularly worse than the current -O2 setup. But in any case, I think we should probably support both lto and non lot versions, just in case ... Philippe |
|
From: Robert W. <rjw...@ic...> - 2017-11-14 00:07:49
|
What kind of speed up do you see for non-LTO test cases run against LTO tool? Interested in how much of the speed up is attributable to the tool. Sent from my iPhone > On Nov 13, 2017, at 2:15 PM, Philippe Waroquiers <phi...@sk...> wrote: > >> On Mon, 2017-11-13 at 11:43 +0100, Ivo Raisr wrote: >> 2017-11-11 23:27 GMT+01:00 Philippe Waroquiers <phi...@sk...>: >>> This mail gives some measurements of the perf impact of using link time >>> optimisations when building valgrind with lto (NB: some hacks >>> documented below were used to build with -flto). >>> >>> A summary of the perf impact is: >>> * callgrind : all perf tests are faster (between 5 to 10%). >>> * memcheck : many tests are faster, some are equal, one degraded >>> (I retried this one later, there was then no degradation). >>> * helgrind : many tests are faster, a few are slower. >>> >>> The regression tests seem basically ok (some 30 failures mostly >>> due to some stacktraces differences, as the tests were also >>> compiled with -flto). >> >> Splendid job, Philippe! >> >> Some of the problems could go away if "fat" object files were used >> (-ffat-lto-objects). > Yes, providing fat objects will allow to produce only one library > version, usable to link with or without lto. > >> >> What I am worrying now is about observability of LTO-built valgrind binaries. >> Every section of gcc manual says that support for debugging information with LTO >> is experimental and that it can produce unexpected results. >> What are your findings here? Were you able to get some useful information >> for example from Valgrind C source code, and from VEX helper functions >> called by generated code? > I did not had to debug anything, so I cannot really judge but > I just tried now some debugging by doing --wait-for-gdb=yes, then put > a few breakpoints, look at so,e variables and args, next and step. > In this small experiment, I had no particular problem to debug. > So, the debugging experience seems not particularly worse than the > current -O2 setup. > But in any case, I think we should probably support both lto and non > lot versions, just in case ... > > Philippe > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Philippe W. <phi...@sk...> - 2017-11-14 17:36:25
|
On Mon, 2017-11-13 at 15:07 -0800, Robert Walsh wrote: > What kind of speed up do you see for non-LTO test cases run against LTO tool? Interested in how much of the speed up is attributable to the tool. The perf tests were from a normal build. So the comparison is between trunk non lto versus trunk compiled with lto, measured on the perf tests from the trunk non lto. So, the speed up is fully attributable to the tool. Philippe |
|
From: Robert W. <rjw...@ic...> - 2017-11-14 17:38:24
|
👍 Sent from my iPhone > On Nov 14, 2017, at 9:36 AM, Philippe Waroquiers <phi...@sk...> wrote: > >> On Mon, 2017-11-13 at 15:07 -0800, Robert Walsh wrote: >> What kind of speed up do you see for non-LTO test cases run against LTO tool? Interested in how much of the speed up is attributable to the tool. > The perf tests were from a normal build. So the comparison is between > trunk non lto versus trunk compiled with lto, > measured on the perf tests from the trunk non lto. > > So, the speed up is fully attributable to the tool. > > Philippe > |