|
From: Philippe W. <phi...@sk...> - 2012-07-27 22:23:47
Attachments:
patch_regvalues.txt
|
Find attached a patch which drops --vex-iropt-precise-memory-exns and adds another clo --vex-iropt-register-updates=minimal|atmemaccess|exact minimal is equivalent to --vex-iropt-precise-memory-exns=no atmemaccess is equivalent to --vex-iropt-precise-memory-exns=yes exact is a new value, ensuring register values are exact at each instruction (mostly useful for gdbserver). Regression tested on f12/x86 and deb6/amd64. Verified that performance impact is as expected (i.e. minimal faster than atmemaccess faster than exact). Feedback welcome. (in particular about verification that exact is effectively ensuring register values are exact at each instruction. Philippe |
|
From: Florian K. <br...@ac...> - 2012-07-28 02:26:43
|
On 07/27/2012 06:23 PM, Philippe Waroquiers wrote:
> Find attached a patch which drops --vex-iropt-precise-memory-exns
> and adds another clo
> --vex-iropt-register-updates=minimal|atmemaccess|exact
>
I like it. The old clo had a sufficiently obscure name that required
consulting the manual to figure out what's happening. And who wants to
do that :)
I would like it even better if we named the choices slightly
differently. How is this:
minimal -> at-endof-sb
atmemaccess -> at-mem-access
exact -> at-insn
I find that clearer.
WRT the implementation...
+ else if VG_XACT_CLO(arg, "--vex-iropt-register-updates=minimal",
+ VG_(clo_vex_control).iropt_register_updates,
+ 0);
Can we please use symbolic names instead of magic constants 0,1,2 ?
enum {
VEX_REG_UPDATE_AT_ENDOF_SB,
VEX_REG_UPDATE_AT_MEM_ACCESS,
VEX_REG_AT_INSN
};
That improves comprehension and nobody get hurt. We already have enough
magic constants. Let's not introduce new ones.
Florian
|
|
From: Philippe W. <phi...@sk...> - 2012-07-28 10:48:54
Attachments:
patch_core_vex_2.txt
|
On Fri, 2012-07-27 at 22:26 -0400, Florian Krohm wrote:
> minimal -> at-endof-sb
> atmemaccess -> at-mem-access
> exact -> at-insn
A discussion with Julian converged to:
--vex-iropt-register-updates=unwindregs-at-mem-access
|allregs-at-mem-access
|allregs-at-each-insn [unwindregs-at-mem-access]
> Can we please use symbolic names instead of magic constants 0,1,2 ?
Good idea => enum defined in libvex.h
New patch attached.
Thanks
Philippe
|
|
From: Florian K. <br...@ac...> - 2012-07-28 13:52:37
|
On 07/28/2012 06:49 AM, Philippe Waroquiers wrote:
> + else if VG_XACT_CLO(arg, "--vgdb=full", VG_(clo_vgdb), Vg_VgdbFull) {
> + /* automatically sets register values to exact with --vgdb=full */
Now that "exact" is gone:
automatically updates registers at each insn with --vgdb=full
> + VG_(clo_vex_control).iropt_register_updates = 2;
> + }
2 -> VexAllregsAtEachInsn
Otherwise looks good to me.
Florian
|
|
From: Josef W. <Jos...@gm...> - 2012-07-30 14:16:35
|
Hi, I like the change, too. It would allow for an even more minimal setting than the current "unwindregs-at-mem-access" one, which seems not needed e.g. for cachegrind/callgrind. As I see, a "only-at-sb-exit" would be enough there, but obviously needs some modifications in VEX. Josef Am 28.07.2012 12:49, schrieb Philippe Waroquiers: > --vex-iropt-register-updates=unwindregs-at-mem-access > |allregs-at-mem-access > |allregs-at-each-insn [unwindregs-at-mem-access] |
|
From: Philippe W. <phi...@sk...> - 2012-08-02 20:28:56
Attachments:
patch_allregs_at_sb_exits.txt
|
On Mon, 2012-07-30 at 16:16 +0200, Josef Weidendorfer wrote:
> It would allow for an even more minimal setting than the current
> "unwindregs-at-mem-access" one, which seems not needed e.g. for
> cachegrind/callgrind. As I see, a "only-at-sb-exit" would be
> enough there, but obviously needs some modifications in VEX.
I tested the idea (see attached patch).
I might/must have done an error in the patch, but I do not see
much difference in performance when I am specifying
--vex-iropt-register-updates=allregs-at-sb-exits
Measurements taken on an old Pentium and on an amd64.
(I just repeated the x86 measurement once, as the first time,
the allregs-at-sb-exits was even slower than unwindregs-at-mem-access).
I used the "perl perf/vg_perf --reps=5, and for that saw no
significant difference).
Looks strange ?
Maybe the IR code is a very small part of callgrind time ?.
With none tool, there is however a little bit more
positive difference on x86 (did not measure amd64) with time,
but again not with perl perf/vg_perf ?
Philippe
for opt in allregs-at-sb-exits unwindregs-at-mem-access allregs-at-mem-access allregs-at-each-insn
do
echo $opt
time ./vg-in-place --stats=yes --tool=callgrind --vex-iropt-register-updates=$opt perf/bz2 2>&1 | grep -e ratio
done
x86
***--9986-- transtab: new 4,067 (95,961 -> 509,291; ratio 53:10) [0 scs]
real 0m53.247s
user 0m52.984s
sys 0m0.181s
unwindregs-at-mem-access
--9991-- transtab: new 4,067 (95,961 -> 592,178; ratio 61:10) [0 scs]
real 0m55.519s
user 0m55.131s
sys 0m0.219s
allregs-at-mem-access
--9997-- transtab: new 4,067 (95,961 -> 667,929; ratio 69:10) [0 scs]
real 0m56.674s
user 0m56.355s
sys 0m0.199s
allregs-at-each-insn
--10003-- transtab: new 4,073 (95,996 -> 772,182; ratio 80:10) [0 scs]
real 0m58.116s
user 0m57.891s
sys 0m0.206s
amd64
*****
allregs-at-sb-exits
--21396-- transtab: new 3,951 (85,106 -> 758,211; ratio 89:10) [0 scs]
real 0m20.950s
user 0m20.913s
sys 0m0.036s
unwindregs-at-mem-access
--21432-- transtab: new 3,951 (85,106 -> 832,620; ratio 97:10) [0 scs]
real 0m21.195s
user 0m21.145s
sys 0m0.048s
allregs-at-mem-access
--21457-- transtab: new 3,951 (85,106 -> 904,511; ratio 106:10) [0 scs]
real 0m21.927s
user 0m21.909s
sys 0m0.016s
allregs-at-each-insn
--21560-- transtab: new 3,951 (85,106 -> 1,082,296; ratio 127:10) [0 scs]
real 0m22.568s
user 0m22.541s
sys 0m0.016s
none tool on x86
****************
allregs-at-sb-exits
--9968-- transtab: new 3,739 (109,926 -> 457,539; ratio 41:10) [0 scs]
real 0m4.107s
user 0m3.969s
sys 0m0.137s
unwindregs-at-mem-access
--9972-- transtab: new 3,739 (109,926 -> 572,328; ratio 52:10) [0 scs]
real 0m4.758s
user 0m4.630s
sys 0m0.127s
allregs-at-mem-access
--9977-- transtab: new 3,739 (109,926 -> 644,673; ratio 58:10) [0 scs]
real 0m5.795s
user 0m5.673s
sys 0m0.122s
allregs-at-each-insn
--9981-- transtab: new 3,739 (109,926 -> 759,343; ratio 69:10) [0 scs]
real 0m7.595s
user 0m7.462s
sys 0m0.127s
|
|
From: Philippe W. <phi...@sk...> - 2012-08-02 20:57:34
|
On Thu, 2012-08-02 at 22:28 +0200, Philippe Waroquiers wrote: > With none tool, there is however a little bit more > positive difference on x86 (did not measure amd64) with time, > but again not with perl perf/vg_perf ? Results on amd64 of perl perf with and without EXTRA_REGTESTS_OPTS WARNING: $EXTRA_REGTEST_OPTS is set. You probably don't want to run the perf tests with it set, unless you are doing some strange experiment, and/or you really know what you are doing. -- Running tests in perf ---------------------------------------------- bigcode1 allregs_at_sb_exits:0.13s no: 2.0s (15.1x, -----) ca:16.1s (123.7x, -----) ca: 6.1s (46.6x, -----) bigcode2 allregs_at_sb_exits:0.13s no: 4.5s (35.0x, -----) ca:28.4s (218.5x, -----) ca: 9.8s (75.7x, -----) bz2 allregs_at_sb_exits:0.67s no: 1.9s ( 2.8x, -----) ca:20.8s (31.1x, -----) ca:16.2s (24.2x, -----) fbench allregs_at_sb_exits:0.28s no: 1.2s ( 4.4x, -----) ca: 6.5s (23.1x, -----) ca: 4.8s (17.1x, -----) ffbench allregs_at_sb_exits:0.25s no: 1.0s ( 4.1x, -----) ca: 2.4s ( 9.7x, -----) ca: 6.0s (24.2x, -----) heap allregs_at_sb_exits:0.10s no: 0.7s ( 6.9x, -----) ca: 7.4s (73.8x, -----) ca: 4.8s (47.5x, -----) heap_pdb4 allregs_at_sb_exits:0.14s no: 0.7s ( 5.1x, -----) ca: 8.2s (58.4x, -----) ca: 5.3s (38.0x, -----) many-loss-records allregs_at_sb_exits:0.01s no: 0.2s (23.0x, -----) ca: 1.2s (125.0x, -----) ca: 0.9s (91.0x, -----) many-xpts allregs_at_sb_exits:0.04s no: 0.3s ( 7.5x, -----) ca: 3.2s (81.2x, -----) ca: 1.4s (34.8x, -----) sarp allregs_at_sb_exits:0.03s no: 0.2s ( 8.0x, -----) ca: 2.2s (72.7x, -----) ca: 1.3s (43.0x, -----) tinycc allregs_at_sb_exits:0.20s no: 1.4s ( 7.0x, -----) ca:12.1s (60.5x, -----) ca:10.3s (51.6x, -----) -- Finished tests in perf ---------------------------------------------- == 11 programs, 33 timings ================= with the normal setup (unwind-regs-at-mem-access) ************************************************* -- Running tests in perf ---------------------------------------------- bigcode1 allregs_at_sb_exits:0.13s no: 2.0s (15.5x, -----) ca:16.1s (123.8x, -----) ca: 6.1s (46.6x, -----) bigcode2 allregs_at_sb_exits:0.13s no: 4.5s (35.0x, -----) ca:28.2s (216.6x, -----) ca: 9.8s (75.5x, -----) bz2 allregs_at_sb_exits:0.68s no: 2.1s ( 3.0x, -----) ca:20.9s (30.8x, -----) ca:16.2s (23.9x, -----) fbench allregs_at_sb_exits:0.27s no: 1.2s ( 4.6x, -----) ca: 6.6s (24.4x, -----) ca: 4.7s (17.3x, -----) ffbench allregs_at_sb_exits:0.25s no: 1.1s ( 4.4x, -----) ca: (interrupted as I was demoralised :). Philippe |
|
From: Josef W. <Jos...@gm...> - 2012-08-03 15:00:03
|
Am 02.08.2012 22:28, schrieb Philippe Waroquiers:
> On Mon, 2012-07-30 at 16:16 +0200, Josef Weidendorfer wrote:
>
>> It would allow for an even more minimal setting than the current
>> "unwindregs-at-mem-access" one, which seems not needed e.g. for
>> cachegrind/callgrind. As I see, a "only-at-sb-exit" would be
>> enough there, but obviously needs some modifications in VEX.
> I tested the idea (see attached patch).
>
> I might/must have done an error in the patch, but I do not see
> much difference in performance when I am specifying
> --vex-iropt-register-updates=allregs-at-sb-exits
> Measurements taken on an old Pentium and on an amd64.
Nice.
I just tried it out, using a simple fibonacci calculation code
(compiled with -O3):
int fib(int n) { return (n<2) ? n : fib(n-1) + fib(n-2); }
int main(int, char* argv[]) { return fib(atoi(argv[1]); }
On my Core-i5:
time ./vg-in-place --tool=none fib 37
real 0m1.003s
user 0m0.956s
sys 0m0.040s
time ./vg-in-place --tool=none
--vex-iropt-register-updates=allregs-at-sb-exits fib 37
real 0m0.862s
user 0m0.816s
sys 0m0.040s
Of course, this is an extreme example of the benefit, but IMHO it's
not bad.
I have a few patches for speeding up cachegrind (which unfortunately
are too late for VG 3.8.0), but the faster the tool, the more important
even such slight improvements. In any case, it never should get worse,
as it reduces the amount of generated code, as you see yourself.
./vg-in-place --tool=none --stats=yes fib 37
...
--16367-- transtab: new 1,805 (41,605 -> 317,445; ratio 76:10)
./vg-in-place --tool=none --stats=yes
--vex-iropt-register-updates=allregs-at-sb-exits fib 37
...
--16356-- transtab: new 1,805 (41,610 -> 281,818; ratio 67:10)
Isn't this reduction worth the patch already?
We can make this the default setting for none / cachegrind / callgrind /
lackey.
> (I just repeated the x86 measurement once, as the first time,
> the allregs-at-sb-exits was even slower than unwindregs-at-mem-access).
> I used the "perl perf/vg_perf --reps=5, and for that saw no
> significant difference).
> Looks strange ?
>
> Maybe the IR code is a very small part of callgrind time ?.
Unfortunately, yes:
time ./vg-in-place --stats=yes --tool=callgrind fib 37
--16384-- transtab: new 1,990 (38,666 -> 414,019; ratio 107:10)
real 0m10.480s
user 0m10.401s
sys 0m0.056s
time ./vg-in-place --stats=yes --tool=callgrind
--vex-iropt-register-updates=allregs-at-sb-exits fib 37
--16396-- transtab: new 1,991 (38,671 -> 383,714; ratio 99:10)
real 0m10.260s
user 0m10.105s
sys 0m0.128s
But still 3% (for that fib code).
Josef
> With none tool, there is however a little bit more
> positive difference on x86 (did not measure amd64) with time,
> but again not with perl perf/vg_perf ?
>
> Philippe
>
>
> for opt in allregs-at-sb-exits unwindregs-at-mem-access allregs-at-mem-access allregs-at-each-insn
> do
> echo $opt
> time ./vg-in-place --stats=yes --tool=callgrind --vex-iropt-register-updates=$opt perf/bz2 2>&1 | grep -e ratio
> done
>
> x86
> ***--9986-- transtab: new 4,067 (95,961 -> 509,291; ratio 53:10) [0 scs]
>
> real 0m53.247s
> user 0m52.984s
> sys 0m0.181s
> unwindregs-at-mem-access
> --9991-- transtab: new 4,067 (95,961 -> 592,178; ratio 61:10) [0 scs]
>
> real 0m55.519s
> user 0m55.131s
> sys 0m0.219s
> allregs-at-mem-access
> --9997-- transtab: new 4,067 (95,961 -> 667,929; ratio 69:10) [0 scs]
>
> real 0m56.674s
> user 0m56.355s
> sys 0m0.199s
> allregs-at-each-insn
> --10003-- transtab: new 4,073 (95,996 -> 772,182; ratio 80:10) [0 scs]
>
> real 0m58.116s
> user 0m57.891s
> sys 0m0.206s
>
>
>
> amd64
> *****
> allregs-at-sb-exits
> --21396-- transtab: new 3,951 (85,106 -> 758,211; ratio 89:10) [0 scs]
>
> real 0m20.950s
> user 0m20.913s
> sys 0m0.036s
> unwindregs-at-mem-access
> --21432-- transtab: new 3,951 (85,106 -> 832,620; ratio 97:10) [0 scs]
>
> real 0m21.195s
> user 0m21.145s
> sys 0m0.048s
> allregs-at-mem-access
> --21457-- transtab: new 3,951 (85,106 -> 904,511; ratio 106:10) [0 scs]
>
> real 0m21.927s
> user 0m21.909s
> sys 0m0.016s
> allregs-at-each-insn
> --21560-- transtab: new 3,951 (85,106 -> 1,082,296; ratio 127:10) [0 scs]
>
> real 0m22.568s
> user 0m22.541s
> sys 0m0.016s
>
>
>
> none tool on x86
> ****************
> allregs-at-sb-exits
> --9968-- transtab: new 3,739 (109,926 -> 457,539; ratio 41:10) [0 scs]
>
> real 0m4.107s
> user 0m3.969s
> sys 0m0.137s
> unwindregs-at-mem-access
> --9972-- transtab: new 3,739 (109,926 -> 572,328; ratio 52:10) [0 scs]
>
> real 0m4.758s
> user 0m4.630s
> sys 0m0.127s
> allregs-at-mem-access
> --9977-- transtab: new 3,739 (109,926 -> 644,673; ratio 58:10) [0 scs]
>
> real 0m5.795s
> user 0m5.673s
> sys 0m0.122s
> allregs-at-each-insn
> --9981-- transtab: new 3,739 (109,926 -> 759,343; ratio 69:10) [0 scs]
>
> real 0m7.595s
> user 0m7.462s
> sys 0m0.127s
>
|
|
From: Philippe W. <phi...@sk...> - 2012-08-04 00:52:19
Attachments:
patch_allregs_at_sb_exits.txt
|
On Fri, 2012-08-03 at 16:59 +0200, Josef Weidendorfer wrote:
> Isn't this reduction worth the patch already?
> We can make this the default setting for none / cachegrind / callgrind /
> lackey.
Find attached a new version of the patch which puts allregs-at-sb-exits
as the default for callgrind and cachegrind.
Did not touch lackey.
For none, not too sure what is preferred: the none tool might
preferrably run with the default value for vex_control ?
There is one regression in cachegrind with this patch:
on a debian6/amd64 system, the test cachegrind/tests/x86/fpu-28-108
crashes with:
==20077== Process terminating with default action of signal 11 (SIGSEGV)
==20077== Access not within mapped region at address 0xFEFFCFBC
==20077== at 0x40061EB: _dl_map_object_from_fd (dl-load.c:917)
==20077== by 0x4007BD0: _dl_map_object (dl-load.c:2329)
==20077== by 0x40016CE: map_doit (rtld.c:633)
==20077== by 0x400E5D5: _dl_catch_error (dl-error.c:178)
==20077== by 0x4001586: do_preload (rtld.c:817)
==20077== by 0x4003B0F: dl_main (rtld.c:1683)
==20077== by 0x4014F26: _dl_sysdep_start (dl-sysdep.c:243)
==20077== by 0x4001237: _dl_start (rtld.c:338)
==20077== by 0x4000856: ??? (in /lib32/ld-2.11.3.so)
The same test does not crash with cachegrind with
--vex-iropt-register-updates=unwindregs-at-mem-access.
The same test does not crash with callgrind.
The same test does not crash on f12/x86 (callgrind or cachegrind).
On ppc64, it looks ok.
So, it looks like there are some subtilities on some distro
when not updating at least the unwind regs at each mem access.
Before we understand the subtility, it does not look a good idea
to put this in 3.8.0.
(or at least then without setting the default to allregs-at-sb-exits).
Philippe
|
|
From: Josef W. <Jos...@gm...> - 2012-08-06 12:17:30
|
[Forgot to CC the list] Am 04.08.2012 02:52, schrieb Philippe Waroquiers: > Find attached a new version of the patch which puts allregs-at-sb-exits > as the default for callgrind and cachegrind. > Did not touch lackey. Thanks. > For none, not too sure what is preferred: the none tool might > preferrably run with the default value for vex_control ? Updating the guest state multiple times within a SB may be needed because of requirements from a tool which wants to print error reports with stack traces. I do not see a reason for "none" needing it. We use "none" in a lot of regression tests to check for bugs in VG core, and the guest updating makes stack traces in bug reports better. But you would try to get a good stack trace in manual bug analysis anyway... > There is one regression in cachegrind with this patch: > on a debian6/amd64 system, the test cachegrind/tests/x86/fpu-28-108 > crashes with: > ==20077== Process terminating with default action of signal 11 (SIGSEGV) > ==20077== Access not within mapped region at address 0xFEFFCFBC > ==20077== at 0x40061EB: _dl_map_object_from_fd (dl-load.c:917) > ==20077== by 0x4007BD0: _dl_map_object (dl-load.c:2329) > ==20077== by 0x40016CE: map_doit (rtld.c:633) > ==20077== by 0x400E5D5: _dl_catch_error (dl-error.c:178) > ==20077== by 0x4001586: do_preload (rtld.c:817) > ==20077== by 0x4003B0F: dl_main (rtld.c:1683) > ==20077== by 0x4014F26: _dl_sysdep_start (dl-sysdep.c:243) > ==20077== by 0x4001237: _dl_start (rtld.c:338) > ==20077== by 0x4000856: ??? (in /lib32/ld-2.11.3.so) > The same test does not crash with cachegrind with > --vex-iropt-register-updates=unwindregs-at-mem-access. > The same test does not crash with callgrind. > The same test does not crash on f12/x86 (callgrind or cachegrind). > On ppc64, it looks ok. > > So, it looks like there are some subtilities on some distro > when not updating at least the unwind regs at each mem access. > > Before we understand the subtility, it does not look a good idea > to put this in 3.8.0. > (or at least then without setting the default to allregs-at-sb-exits). Agreed. Does "none" work with this test case with allregs-at-sb-exits on debian6/amd64? Josef > > Philippe > |
|
From: Josef W. <Jos...@gm...> - 2012-08-06 21:56:34
|
Am 06.08.2012 22:36, schrieb Philippe Waroquiers: > For sure, the patch is not ready yet :). Hmm.. I think I know the problem. I was able to reproduce the failure for none/tests/pending (amd64), and it always fails in the same place, and you can compare with the run using unwindregs-at-mem-access: Both get signal 11 at some point because of a stack underrun. The version using unwindregs-at-mem-access goes on and growths the stack, while the version using allregs-at-sb-exits gets killed by the signal. I assume because the guest state was not up-to-date, Valgrind's SEGFAULT handler was not able to detect that this was a stack underrun. So ensuring the guest stack register is up-to-date before a memory write is generally important for Valgrind being able to handle stack underruns. We still could get rid of register updates for RBP/RIP. A completely different approach to restore guest register state when interrupted within execution of a SB would be to store meta information about the register allocation done at translation time, and reconstruct the state from that when a signal is raised. But I do not think it's worth doing that. Josef |
|
From: Philippe W. <phi...@sk...> - 2012-08-07 20:46:28
Attachments:
patch_extend_stack_base.txt
patch_allregs_at_sb_exits.txt
|
On Mon, 2012-08-06 at 23:56 +0200, Josef Weidendorfer wrote: > I assume because the guest state was not up-to-date, Valgrind's > SEGFAULT handler was not able to detect that this was a stack underrun. That looks to be the problem, because ensuring the stackpointer is up to date or changing the detection logic for stack underrun fixes all tests. Two patches attached: 1. a patch (horrible hack) that changes the logic to detect stack underrun. I do not think this is the way to go. I attach it for the record. Instead: 2. a complete patch with the solution to have the stack pointer being made up to date at memRW. (this looks somewhat cleaner/safer). This second patch has run all regression tests on amd64. All none tests have also run with --vex-iropt-register-updates=allregs-at-sb-exits. Patch done for other archs (arm, s390, ppc32, ppc64) but not tested. Philippe |
|
From: Philippe W. <phi...@sk...> - 2012-08-07 21:04:57
Attachments:
patch_allregs_at_sb_exits.txt
|
On Tue, 2012-08-07 at 22:46 +0200, Philippe Waroquiers wrote: > Instead: > 2. a complete patch with the solution to have the stack pointer > being made up to date at memRW. > (this looks somewhat cleaner/safer). Correct version of the patch attached now. Philippe |
|
From: Josef W. <Jos...@gm...> - 2012-08-08 15:22:18
|
Am 07.08.2012 23:04, schrieb Philippe Waroquiers:
> On Tue, 2012-08-07 at 22:46 +0200, Philippe Waroquiers wrote:
>
>> Instead:
>> 2. a complete patch with the solution to have the stack pointer
>> being made up to date at memRW.
>> (this looks somewhat cleaner/safer).
> Correct version of the patch attached now.
> Philippe
Looks good to me, but because of the VEX changes for all archs
it definitely is stuff after 3.8.0.
BTW, I think "allregs-at-sb-exits" is a misnomer, as all other
modes also update "all regs at sb exit". I would go for
"only-at-sb-exits".
Anyway, using the previously mentioned fibonacci code, I get
the following using "perf stat" (Core i-5, Ubuntu 12.04).
VG 3.7.0:
perf stat /usr/bin/valgrind.bin --tool=none fib 42
25.558.006.913 cycles
55.340.821.132 instructions
6.241.658.546 branches
249.880.425 branch-misses
8,886495979 seconds time elapsed
VG 3.8.0SVN:
perf stat ./vg-in-place --tool=none fib 42
25.594.336.065 cycles
46.474.374.107 instructions
4.116.079.890 branches
169.771.115 branch-misses
8,916505520 seconds time elapsed
The difference of -15% in instructions and -35% in branches must
be the effect of chaining introduced with 3.8.0. It is
unclear to me why this does not result in any performance
improvement.
Anyway, with allregs-at-sb-exits (your patch):
perf stat ./vg-in-place --tool=none \
--vex-iropt-register-updates=allregs-at-sb-exits fib 42
21.792.105.898 cycles
43.796.121.724 instructions
4.104.165.014 branches
117.870.745 branch-misses
7,589564685 seconds time elapsed
Josef
|
|
From: Philippe W. <phi...@sk...> - 2012-08-08 22:37:37
|
On Wed, 2012-08-08 at 17:22 +0200, Josef Weidendorfer wrote: > BTW, I think "allregs-at-sb-exits" is a misnomer, as all other > modes also update "all regs at sb exit". I would go for > "only-at-sb-exits". allregs-only-at-sb-exits ? (as --vex-irop-register-updates=only-at-sb-exits seems also somewhat misleading as some registers are updated in the middle of sb, even with this mode). In any case, it looks not possible to summarise precisely in a few words when/which registers are updated (e.g. helper calls also imply some registers to be updated). > The difference of -15% in instructions and -35% in branches must > be the effect of chaining introduced with 3.8.0. It is > unclear to me why this does not result in any performance > improvement. I consider more and more performance measurement of Valgrind as a not understandable magic. There are too many factors which are uncontrolled. (cpu frequency scaling was off during the measurements I guess) ? Philippe |
|
From: Josef W. <Jos...@gm...> - 2012-08-09 13:00:43
|
Am 09.08.2012 00:37, schrieb Philippe Waroquiers: > On Wed, 2012-08-08 at 17:22 +0200, Josef Weidendorfer wrote: >> BTW, I think "allregs-at-sb-exits" is a misnomer, as all other >> modes also update "all regs at sb exit". I would go for >> "only-at-sb-exits". > allregs-only-at-sb-exits ? > (as --vex-irop-register-updates=only-at-sb-exits seems also somewhat > misleading as some registers are updated in the middle of sb, > even with this mode). > In any case, it looks not possible to summarise precisely > in a few words when/which registers are updated > (e.g. helper calls also imply some registers to be updated). This is an advanced option anyway, and people interested in changing it should look up the manual, so as long as the name makes some sense, I am fine. I think it is more important that tool authors are aware of the subtilities of VEX options, including that one, and set the default mode depending on the needs of their tool. >> The difference of -15% in instructions and -35% in branches must >> be the effect of chaining introduced with 3.8.0. It is >> unclear to me why this does not result in any performance >> improvement. > I consider more and more performance measurement of Valgrind > as a not understandable magic. If the relevant code is not too much (ie. not too much effects), and results are stable, it's sometimes useful to dig deeper, and it can be understandable ;-) > There are too many factors which are uncontrolled. > (cpu frequency scaling was off during the measurements I guess) ? No, but it was running on 1 core (pinned), and because of the > 8s for sure it was running full time in turbo mode. Josef > Philippe > > > |