Thread: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

Brought to you by: njn, sewardj, wielaard

valgrind-users

[Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Christian L. <chr...@le...> - 2003-03-17 12:12:25

Attachments: valgrind_label.diff

Hello,

first let me say:

valgrind is absolutly great, thank you for it!
It saved me allready often.

Because I was not able to add MMX support (seems not to be exactly
easy).
I saw this realy big switch(opt){case 0x00....} statement and thought
"this must be slow".

So I tried the "Labels as Values" feature of the gcc.
(http://www.dis.com/gnu/gcc/gcc_79.html#SEC79)
Yes it is a gcc feature, but because is specially for x86/Linux I can't
see a problem.

"benchmark":
core:/home/ijuz/Mail/la# ls -l
total 4508
-rw-------    1 ijuz     ijuz      4602189 Mar 17 11:56
politech_at_politechbot_com

core:/home/ijuz/Mail/la# nice -n -10 time /work/dev/val/bin/valgrind
gzip -9 politech_at_politechbot_com

normali-cvs:
13.57user 0.09system 0:14.07elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k
13.51user 0.08system 0:14.01elapsed 96%CPU (0avgtext+0avgdata
0maxresident)k

patched-cvs:
12.54user 0.06system 0:13.07elapsed 96%CPU (0avgtext+0avgdata
0maxresident)k
12.56user 0.06system 0:12.97elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k

this means (average):

13.54 -> 12.55
that is a speedup of about 7% in this case

If you (Julian or Nick) are perhaps interested in it, I would clean it up
further and remove this one switch() { } statement.

I appended the patch, I hope nobody objects because of this few kb.
(It is against the cvs from now)

Regards,
Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Nicholas N. <nj...@ca...> - 2003-03-17 12:37:56

On Mon, 17 Mar 2003, Christian Leber wrote:

> I saw this realy big switch(opt){case 0x00....} statement and thought
> "this must be slow".
>
> So I tried the "Labels as Values" feature of the gcc.
> (http://www.dis.com/gnu/gcc/gcc_79.html#SEC79)
> Yes it is a gcc feature, but because is specially for x86/Linux I can't
> see a problem.

[snip]

> 13.54 -> 12.55
> that is a speedup of about 7% in this case
>
> If you (Julian or Nick) are perhaps interested in it, I would clean it up
> further and remove this one switch() { } statement.

Interesting.  Two comments:

1. Can you try it with a few other programs?  It would be nice to see if
the 7% speedup is consistent across programs.  Some programs I've used
for performance timing in the past:

   gzip
   bzip2
   latex
   konqueror (startup then quit immediately)

Batch programs are obviously more suitable for this.

2. The bottom of the webpage you mention says:

   "An alternate way to write the above example is

   static const int array[] = { &&foo - &&foo, &&bar - &&foo,
                                &&hack - &&foo };
   goto *(&&foo + array[i]);

   This is more friendly to code living in shared libraries, as it reduces
   the number of dynamic relocations that are needed, and by consequence,
   allows the data to be read-only."

Valgrind is packaged as a shared library.  Perhaps this might improve
things further?

If you could try these two experiments, and tell us the results, that
would be very helpful.

Thanks.

N

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Christian L. <chr...@le...> - 2003-03-17 14:06:53

On Mon, Mar 17, 2003 at 12:37:52PM +0000, Nicholas Nethercote wrote:

> 1. Can you try it with a few other programs?  It would be nice to see if
> the 7% speedup is consistent across programs.  Some programs I've used
> for performance timing in the past:
> 
>    gzip
>    bzip2

48.36user 0.05system 0:49.79elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k
48.47user 0.07system 0:51.22elapsed 94%CPU (0avgtext+0avgdata
0maxresident)k

patched:
47.93user 0.09system 0:49.17elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k
47.91user 0.09system 0:49.22elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k

>    latex

real    0m5.088s
user    0m4.790s
sys     0m0.130s

patched:
real    0m4.741s
user    0m4.490s
sys     0m0.140s

>    konqueror (startup then quit immediately)

It eats up all CPU till it -9 kill it when I quit it.
(I still have KDE 2.2)

> 2. The bottom of the webpage you mention says:
> 
>    "An alternate way to write the above example is
> 
>    static const int array[] = { &&foo - &&foo, &&bar - &&foo,
>                                 &&hack - &&foo };
>    goto *(&&foo + array[i]);
> 
>    This is more friendly to code living in shared libraries, as it reduces
>    the number of dynamic relocations that are needed, and by consequence,
>    allows the data to be read-only."
> 
> Valgrind is packaged as a shared library.  Perhaps this might improve
> things further?

Yes, but initially I did not get it working, was a little stupid error.
I'll try it, but this takes some time.

cachegrind shows me with a little test programm, that it's slower with
the later thing. But it's not a shared library and I don't know how
often it has to be done.


Regards,
Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Christian L. <chr...@le...> - 2003-03-17 14:33:19

Attachments: valgrind_label_relativ.diff

On Mon, Mar 17, 2003 at 12:37:52PM +0000, Nicholas Nethercote wrote:

> Valgrind is packaged as a shared library.  Perhaps this might improve
> things further?

Ok, I changed it (patch included).

But the numbers are not good, actually a performance decrease against
switch().

gzip:
13.78user 0.03system 0:14.21elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k
13.71user 0.06system 0:14.16elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k

bzip2:
48.49user 0.10system 0:49.79elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k
48.66user 0.04system 0:50.06elapsed 97%CPU (0avgtext+0avgdata
0maxresident)k

latex:
real    0m5.056s
user    0m4.880s
sys     0m0.090s

Regards,
Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Jason E. <ja...@ca...> - 2003-03-17 19:20:00

On Mon, Mar 17, 2003 at 03:33:13PM +0100, Christian Leber wrote:
> On Mon, Mar 17, 2003 at 12:37:52PM +0000, Nicholas Nethercote wrote:
> 
> > Valgrind is packaged as a shared library.  Perhaps this might improve
> > things further?
> 
> Ok, I changed it (patch included).
> 
> But the numbers are not good, actually a performance decrease against
> switch().

I've recently been doing some experimentation with computed gotos in an
unrelated program, and I've also observed a slowdown in most cases.  This
indicates to me that gcc typically does a fine job of optimizing switch
statements, and there isn't a whole lot to be gained by second guessing it
in such cases.

Jason

Re: [Valgrind-users] Preliminary little speedup patch (seen 7% speedup)

From: Nicholas N. <nj...@ca...> - 2003-03-17 19:37:29

On Mon, 17 Mar 2003, Jason Evans wrote:

> > But the numbers are not good, actually a performance decrease against
> > switch().
>
> I've recently been doing some experimentation with computed gotos in an
> unrelated program, and I've also observed a slowdown in most cases.  This
> indicates to me that gcc typically does a fine job of optimizing switch
> statements, and there isn't a whole lot to be gained by second guessing it
> in such cases.

I think Christian got good results with the first version, but not with
the second version that computes each label address as a difference from a
base label address.  So it looks like it (the first version) is
worthwhile.

I've had mixed successes with computed gotos in the past as well...
branch prediction is very important;  I once tried the computed goto trick
which saved me quite a few instructions, but made the branches more or
less unpredictable, and the end result was little difference.

N