|
From: Ivo R. <iv...@iv...> - 2014-10-29 05:19:53
|
I would like to discuss possible changes to the tests in perf/directory so the performance results are more indicative. Currently on my amd64/Linux machine the results are as follows: -- Running tests in perf ---------------------------------------------- bigcode1 valgrind-upstream:0.12s no: 1.8s (14.7x, -----) me: 3.5s (28.8x, -----) bigcode2 valgrind-upstream:0.12s no: 4.0s (33.0x, -----) me: 8.4s (70.4x, -----) bz2 valgrind-upstream:0.56s no: 1.9s ( 3.3x, -----) me: 7.3s (13.1x, -----) fbench valgrind-upstream:0.25s no: 1.1s ( 4.6x, -----) me: 4.3s (17.3x, -----) ffbench valgrind-upstream:0.24s no: 1.1s ( 4.7x, -----) me: 3.3s (13.8x, -----) heap valgrind-upstream:0.10s no: 0.8s ( 7.6x, -----) me: 5.7s (56.6x, -----) heap_pdb4 valgrind-upstream:0.10s no: 0.8s ( 8.0x, -----) me: 9.0s (90.1x, -----) many-loss-records valgrind-upstream:0.01s no: 0.3s (33.0x, -----) me: 1.6s (158.0x, -----) many-xpts valgrind-upstream:0.05s no: 0.4s ( 7.8x, -----) me: 2.1s (42.0x, -----) sarp valgrind-upstream:0.01s no: 0.4s (36.0x, -----) me: 2.5s (249.0x, -----) tinycc valgrind-upstream:0.17s no: 1.7s (10.1x, -----) me: 9.4s (55.2x, -----) -- Finished tests in perf ---------------------------------------------- Please pay attention specially at tests which native execution completes in less than 100 milliseconds. Native running time is so short that calculation of the performance speedup is heavily skewed. In particular there are 2 tests which run in less than 10 milliseconds. In these cases, the performance speedup calculation is pure guesswork. I would like to address this shortcoming so the results are more indicative. Although vg_perf offers '--reps' option, it just chooses the best running time; so it does not help here. One possible approach would be to run a test as many times as required to cover let's say 1 second during native run. And so many runs will be used for running instrumented. However this approach will measure mainly the start program sequence in case of short tests. So I was thinking about different approach where the running time of tests will be first normalized (increasing or reducing work) so they finish roughly in the same time for native run. All tests will also accept command line option for the number of main iterations to be performed. When vg_perf starts, it will measure time required to perform 1 iteration of a representative test. And according to this measurement, it will adjust automatically the number of iterations passed to each test so that every native test runs roughly let's say for 1 second. I am writing on valgrind-developers and not submitting a bug report because I would like to solicit opinions first. And because bug reports seem to be unhandled and forgotten. Let me know, I. |
|
From: Philippe W. <phi...@sk...> - 2014-11-01 16:25:13
|
On Wed, 2014-10-29 at 06:19 +0100, Ivo Raisr wrote: > I would like to address this shortcoming so the results are more > indicative. > > Although vg_perf offers '--reps' option, it just chooses the best > running time; > > so it does not help here. > > > One possible approach would be to run a test as many times as required > to cover let's say 1 second during native run. And so many runs will > be used > for running instrumented. However this approach will measure mainly > the > > start program sequence in case of short tests. For sure, the perf tests are far to be perfect. A.o., as you say, the perf ratio with native run is heavily skewed by small variations. We also just have significant variations in a test (e.g. memcheck) between multiple runs. So, would be nice to have something better. > > > So I was thinking about different approach where the running time of > tests will be first > > normalized (increasing or reducing work) so they finish roughly in the > same time for > > native run. All tests will also accept command line option for the > number of main iterations > to be performed. When vg_perf starts, it will measure time required to > perform 1 iteration > of a representative test. And according to this measurement, it will > adjust automatically > > the number of iterations passed to each test so that every native test > runs roughly let's > say for 1 second. This approach will have as disadvantage that we cannot compare anymore different platforms together. Also, to keep the performance ratio correct, it means that the nr of iterations needed will have to be propagated from the native run to the run under the various tools. The perf tests will then run even more slowly (e.g. on ppc64, the perf tests takes about one hour). So, I am not too sure about what to do. Typically, "rich" projects similar to valgrind have access to tests such as "SPEC" and use that to compare to valgrind. But these tests are to my knowledge not free (both in terms of free beer and free speech). Does someone know about good free perf tests ? > > > I am writing on valgrind-developers and not submitting a bug report > because > > I would like to solicit opinions first. And because bug reports seem > to be unhandled > and forgotten. True, but posts on valgrind developers tend also to be forgotten :). Philippe |
|
From: Ivo R. <iv...@iv...> - 2014-11-14 22:59:40
|
Hi Philippe, Thank you for your response! My apologies for not replying earlier, I was travelling. 2014-11-01 17:25 GMT+01:00 Philippe Waroquiers < phi...@sk...>: > On Wed, 2014-10-29 at 06:19 +0100, Ivo Raisr wrote: > > > I would like to address this shortcoming so the results are more > > indicative. > > > For sure, the perf tests are far to be perfect. > A.o., as you say, the perf ratio with native run is heavily skewed by > small variations. > We also just have significant variations in a test (e.g. memcheck) > between multiple runs. > So, would be nice to have something better. > Since only the 'user' time is measured, I believe these variations are mainly due to start sequence of the measured program. Also 'time -p' is used for compatibility with AIX (which is long dead) so precision is lost here. > > So I was thinking about different approach where the running time of > > tests will be first normalized (increasing or reducing work) so they > finish roughly in the > > same time for native run. All tests will also accept command line option > for the > > number of main iterations to be performed. When vg_perf starts, it will > measure time required to > > perform 1 iteration of a representative test. And according to this > measurement, it will > > adjust automatically the number of iterations passed to each test so > that every native test > > runs roughly let's say for 1 second. > > This approach will have as disadvantage that we cannot compare > anymore different platforms together. > I can address that, see below. > Also, to keep the performance ratio correct, it means that the > nr of iterations needed will have to be propagated from the > native run to the run under the various tools. Yes, that's what I intend to do. See below. > The perf tests will then run even more slowly (e.g. on ppc64, the perf > tests takes > about one hour). > Not necessarily. There are several perf tests which native run is close to 5 seconds. Multiply this with number of iterations and number of tools and you get several minutes. > > So, I am not too sure about what to do. > Typically, "rich" projects similar to valgrind have access to > tests such as "SPEC" and use that to compare to valgrind. > But these tests are to my knowledge not free (both in terms > of free beer and free speech). > Yes, SPEC benchmarks are not free. Nevertheless, I think the perf tests we have currently in perf/ directory are actually quite good (see perf/README) and some of them resemble SPEC ones. In addition to that, running a SPEC benchmark would take much more time than current perf/ tests. > > Does someone know about good free perf tests? > There is an initiative called OpenBench [1] but seems rather untouched recently. > > And because bug reports seem to be unhandled and forgotten. > True, but posts on valgrind developers tend also to be forgotten :). > Currently I have the following orphan bug reports: https://bugs.kde.org/show_bug.cgi?id=339636 (rex64/fxsave) https://bugs.kde.org/show_bug.cgi?id=340320 (m_replacemalloc command line options) Not so bad considering this represents only 7% of my total report count ;-) [1] http://www.exactcode.de/site/open_source/openbench/ =========================== So here is my proposal. PHASE 1: ---------------------------- More accurate timing and better formatting of perf tests. Example of the proposed output: --tools=none,memcheck,callgrind,helgrind,cachegrind,drd,massif --reps=3 --vg=../valgrind-new --vg=../valgrind-old -- Running tests in perf ---------------------------------------------- -- bigcode1 valgrind-new: native 0.223s none: 1.605s ( 7.2x, -----) memcheck: 3.395s (15.1x, -----) callgrind:18.139s (82.3x, -----) helgrind: 2.001s ( 9.3x, -----) cachegrind: 5.404s (24.5x, -----) drd: 2.004s ( 8.9x, -----) massif: 2.306s (10.3x, -----) -- bigcode1 valgrind-old: native 0.223s none: 1.976s ( 8.4x,-17.1%) memcheck: 3.203s (14.4x, 5.1%) callgrind:18.750s (85.2x, -3.5%) helgrind: 2.096s ( 9.2x, 0.5%) cachegrind: 5.383s (24.3x, 0.7%) drd: 2.095s ( 9.0x, -0.5%) massif: 2.101s ( 9.8x, 5.3%) ... -- Finished tests in perf ------------------------------ ---------------- == 11 programs, 154 timings ================= So basically every tool gets its own line (there will be more information there in subsequent phases). Timing is printed with millisecond precision. This is achieved by measuring the user time spent directly in the benchmark tests, completely omitting measuring the start sequence of these programs. The idea is to put getrusage() at the beginning and end of main() and print delta of the user time returned by these calls. PHASE 2: --------------------------- In phase 2, I would like to address variation in running time between individual benchmark tests, where possible. So ideally all native runs would finish around let's say 1 second. It will no longer be possible that some runs finish in less than 10ms, skewing all computations. A benchmark test in native run will measure itself a number of iterations performed roughly within the desired interval (1 second, for example). This number of iterations would be then passed to the subsequent runs under tools. This number of iterations will appear also in the perf/ output and thus it will still be possible to compare different platforms, for example: platform A: 1405 iterations within 1.041s platform B: 957 iterations within 1.003s and these numbers can be trivially converted into cross-platform "units" (also printed in the output). Sounds good? Let me know, I. |