Thread: Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

Brought to you by: njn, sewardj, wielaard

valgrind-users

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Julian S. <js...@ac...> - 2003-03-18 08:37:24

[2nd try at getting this to v-users list]

On Monday 17 March 2003 7:37 pm, Nicholas Nethercote wrote:
> On Mon, 17 Mar 2003, Jason Evans wrote:
> > > But the numbers are not good, actually a performance decrease against
> > > switch().
> >
> > I've recently been doing some experimentation with computed gotos in an
> > unrelated program, and I've also observed a slowdown in most cases.  This
> > indicates to me that gcc typically does a fine job of optimizing switch
> > statements, and there isn't a whole lot to be gained by second guessing
> > it in such cases.

The computed goto thing is useful for speeding up bytecode interpreters
-- I've used it for that before now -- but this isn't such a case.  It's
the switch in the x86 parser which switches on the opcodes being
examined.  It is used only once per instruction which V translates and
so the cost difference (a few host insns) must be completely swamped by
the rest of the translation costs (000s of host insns per translated
insn, typically).  And translation costs are usually small (0-15%)
compared to the cost of running the translation.  So I'm mystified where
the 7% speedup number comes from.

J

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Christian L. <chr...@le...> - 2003-03-18 10:51:15

On Tue, Mar 18, 2003 at 08:45:08AM +0000, Julian Seward wrote:

> The computed goto thing is useful for speeding up bytecode interpreters
> -- I've used it for that before now -- but this isn't such a case.  It's
> the switch in the x86 parser which switches on the opcodes being
> examined.  It is used only once per instruction which V translates and
> so the cost difference (a few host insns) must be completely swamped by
> the rest of the translation costs (000s of host insns per translated
> insn, typically).  And translation costs are usually small (0-15%)
> compared to the cost of running the translation.  

> So I'm mystified where
> the 7% speedup number comes from.

Yes, absolutly, it's very obscure.

Some little changes decresed the performance again, damn, I fooled
myself.

It could be some alignment effect.
So perhaps this stupid 7% are also only on athlon-xp's.

But I don't know how to analyse what function takes how long, than it
should be easy to find the function were this difference is.


Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Nicholas N. <nj...@ca...> - 2003-03-18 10:58:24

On Tue, 18 Mar 2003, Christian Leber wrote:

> > So I'm mystified where the 7% speedup number comes from.
>
> Yes, absolutly, it's very obscure.
>
> Some little changes decresed the performance again, damn, I fooled
> myself.
>
> It could be some alignment effect.
> So perhaps this stupid 7% are also only on athlon-xp's.
>
> But I don't know how to analyse what function takes how long, than it
> should be easy to find the function were this difference is.

If you #include "profile.c" into a skin and recompile, the --profile=yes
option turns on tick-based profiling.  That's where Julian got his 0--15%
figure for translation from (note that's for all translation).

N

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Christian L. <chr...@le...> - 2003-03-20 00:57:28

On Tue, Mar 18, 2003 at 10:58:20AM +0000, Nicholas Nethercote wrote:

> If you #include "profile.c" into a skin and recompile, the --profile=yes
> option turns on tick-based profiling.  That's where Julian got his 0--15%
> figure for translation from (note that's for all translation).

Sorry, a little bit late.
I runned with both, but the "resolution" isn't good enough, the
numbers are ok and reproducable.

with the patch:
core:/home/ijuz/Mail/la# time nice -n -10 /work/dev/val/bin/valgrind --profile=yes gzip -9 politech_at_politechbot_com
==25094== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux.
==25094== Copyright (C) 2002, and GNU GPL'd, by Julian Seward.
==25094== Using valgrind-1.9.4, a program instrumentation system for x86-linux.
==25094== Copyright (C) 2000-2002, and GNU GPL'd, by Julian Seward.
==25094== Estimated CPU clock rate is 1551 MHz
==25094== For more details, rerun with: -v
==25094==
==25094==
==25094== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==25094== malloc/free: in use at exit: 0 bytes in 0 blocks.
==25094== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==25094== For a detailed leak analysis,  rerun with: --leak-check=yes
==25094== For counts of detected errors, rerun with: -v

Profiling done, 1402 ticks
 0:    2 (  1 %%) ticks,           1 entries   for  unclassified
 1: 1225 (873 %%) ticks,        4759 entries   for  running
 2:    6 (  4 %%) ticks,           1 entries   for  scheduler
 3:    3 (  2 %%) ticks,       40568 entries   for  low-lev malloc/free
 4:    0 (  0 %%) ticks,           0 entries   for  client  malloc/free
 5:    0 (  0 %%) ticks,           0 entries   for  adjust-stack
 6:    0 (  0 %%) ticks,        1190 entries   for  translate-main
 7:    0 (  0 %%) ticks,        1190 entries   for  to-ucode
 8:    3 (  2 %%) ticks,        1190 entries   for  from-ucode
 9:    0 (  0 %%) ticks,        1190 entries   for  improve
10:    1 (  0 %%) ticks,        1190 entries   for  reg-alloc
11:    0 (  0 %%) ticks,        1190 entries   for  liveness-analysis
12:    0 (  0 %%) ticks,           0 entries   for  do-lru
13:    0 (  0 %%) ticks,        2382 entries   for  slow-search-transtab
14:    0 (  0 %%) ticks,           1 entries   for  init-memory
15:    0 (  0 %%) ticks,           0 entries   for  exe-context
16:    0 (  0 %%) ticks,           0 entries   for  read-syms
17:    0 (  0 %%) ticks,           0 entries   for  search-syms
18:    0 (  0 %%) ticks,           0 entries   for  add-to-transtab
19:    0 (  0 %%) ticks,         459 entries   for  core-syscall-wrapper
20:    0 (  0 %%) ticks,           0 entries   for  demangle
21:    0 (  0 %%) ticks,        3329 entries   for  core-cheap-sanity
22:    3 (  2 %%) ticks,         134 entries   for  core-expensive-sanity
23:    0 (  0 %%) ticks,           0 entries   for  pre-clo-init
24:    0 (  0 %%) ticks,           0 entries   for  post-clo-init
25:    1 (  0 %%) ticks,        1190 entries   for  instrument
26:    0 (  0 %%) ticks,         478 entries   for  skin-syscall-wrapper
27:    0 (  0 %%) ticks,        3329 entries   for  skin-cheap-sanity
28:   30 ( 21 %%) ticks,         134 entries   for  skin-expensive-sanity
29:    0 (  0 %%) ticks,           0 entries   for  fini
30:    2 (  1 %%) ticks,        1436 entries   for  check-mem-perms
31:  126 ( 89 %%) ticks,    52566183 entries   for  set-mem-perms

real    0m15.607s
user    0m13.960s
sys     0m0.090s


and without:

core:/home/ijuz/Mail/la# time nice -n -10 /work/dev/val/bin/valgrind --profile=yes gzip -9 politech_at_politechbot_com
==25629== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux.
==25629== Copyright (C) 2002, and GNU GPL'd, by Julian Seward.
==25629== Using valgrind-1.9.4, a program instrumentation system for x86-linux.
==25629== Copyright (C) 2000-2002, and GNU GPL'd, by Julian Seward.
==25629== Estimated CPU clock rate is 1545 MHz
==25629== For more details, rerun with: -v
==25629==
==25629==
==25629== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==25629== malloc/free: in use at exit: 0 bytes in 0 blocks.
==25629== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==25629== For a detailed leak analysis,  rerun with: --leak-check=yes
==25629== For counts of detected errors, rerun with: -v

Profiling done, 1528 ticks
 0:    1 (  0 %%) ticks,           1 entries   for  unclassified
 1: 1320 (863 %%) ticks,        4757 entries   for  running
 2:    9 (  5 %%) ticks,           1 entries   for  scheduler
 3:    2 (  1 %%) ticks,       40568 entries   for  low-lev malloc/free
 4:    0 (  0 %%) ticks,           0 entries   for  client  malloc/free
 5:    0 (  0 %%) ticks,           0 entries   for  adjust-stack
 6:    0 (  0 %%) ticks,        1190 entries   for  translate-main
 7:    1 (  0 %%) ticks,        1190 entries   for  to-ucode
 8:    1 (  0 %%) ticks,        1190 entries   for  from-ucode
 9:    2 (  1 %%) ticks,        1190 entries   for  improve
10:    2 (  1 %%) ticks,        1190 entries   for  reg-alloc
11:    0 (  0 %%) ticks,        1190 entries   for  liveness-analysis
12:    0 (  0 %%) ticks,           0 entries   for  do-lru
13:    0 (  0 %%) ticks,        2380 entries   for  slow-search-transtab
14:    0 (  0 %%) ticks,           1 entries   for  init-memory
15:    0 (  0 %%) ticks,           0 entries   for  exe-context
16:    0 (  0 %%) ticks,           0 entries   for  read-syms
17:    0 (  0 %%) ticks,           0 entries   for  search-syms
18:    0 (  0 %%) ticks,           0 entries   for  add-to-transtab
19:    0 (  0 %%) ticks,         459 entries   for  core-syscall-wrapper
20:    0 (  0 %%) ticks,           0 entries   for  demangle
21:    0 (  0 %%) ticks,        3329 entries   for  core-cheap-sanity
22:    7 (  4 %%) ticks,         134 entries   for  core-expensive-sanity
23:    0 (  0 %%) ticks,           0 entries   for  pre-clo-init
24:    0 (  0 %%) ticks,           0 entries   for  post-clo-init
25:    3 (  1 %%) ticks,        1190 entries   for  instrument
26:    0 (  0 %%) ticks,         478 entries   for  skin-syscall-wrapper
27:    0 (  0 %%) ticks,        3329 entries   for  skin-cheap-sanity
28:   31 ( 20 %%) ticks,         134 entries   for  skin-expensive-sanity
29:    0 (  0 %%) ticks,           0 entries   for  fini
30:    6 (  3 %%) ticks,        1436 entries   for  check-mem-perms
31:  143 ( 93 %%) ticks,    52566183 entries   for  set-mem-perms

real    0m16.796s
user    0m15.230s
sys     0m0.090s


Christian Leber 

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Christian L. <chr...@le...> - 2003-03-20 01:40:20

On Tue, Mar 18, 2003 at 10:58:20AM +0000, Nicholas Nethercote wrote:
> If you #include "profile.c" into a skin and recompile, the --profile=yes
> option turns on tick-based profiling.  That's where Julian got his 0--15%
> figure for translation from (note that's for all translation).

Ok, sorry, forgot all the junk I wrote.

make CC="gcc -falign-functions=16" install

and "running" was 1215 ticks, with 8 it's again 1315

(allway 16 byte alignment)
speed: function+loops (1349 ticks) < function (1215) < function+jump
(1207) <function+jump+label (1194) <function+jump+label+loops (1173)

compare it to the values in the other mail

At least the function alignment seem to be good on a Athlon, but I don't
know about other system, but I think that it also won't be bad.

Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>