|
From: <sv...@va...> - 2005-12-17 00:22:43
|
Author: njn
Date: 2005-12-17 00:22:39 +0000 (Sat, 17 Dec 2005)
New Revision: 5362
Log:
Add a malloc/free stress test.
Added:
trunk/perf/heap.c
trunk/perf/heap.vgperf
Modified:
trunk/perf/Makefile.am
trunk/perf/README
trunk/perf/vg_perf.in
Modified: trunk/perf/Makefile.am
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/Makefile.am 2005-12-16 17:06:37 UTC (rev 5361)
+++ trunk/perf/Makefile.am 2005-12-17 00:22:39 UTC (rev 5362)
@@ -6,10 +6,11 @@
bz2.vgperf \
fbench.vgperf \
ffbench.vgperf \
+ heap.vgperf \
sarp.vgperf
=20
check_PROGRAMS =3D \
- bigcode bz2 fbench ffbench sarp
+ bigcode bz2 fbench ffbench heap sarp
=20
AM_CFLAGS =3D $(WERROR) -Winline -Wall -Wshadow -g -O
AM_CPPFLAGS =3D -I$(top_srcdir) -I$(top_srcdir)/include -I$(top_builddir=
)/include
Modified: trunk/perf/README
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/README 2005-12-16 17:06:37 UTC (rev 5361)
+++ trunk/perf/README 2005-12-17 00:22:39 UTC (rev 5362)
@@ -13,6 +13,15 @@
of runtime, particularly on larger programs.
- Weaknesses: Highly artificial.
=20
+heap:
+- Description: Does a lot of heap allocation and deallocation, and has a=
lot
+ of heap blocks live while doing so.
+- Strengths: Stress test for an important sub-system; bug #105039 show=
ed
+ that inefficiencies in heap allocation can make a big
+ difference to programs that allocate a lot.
+- Weaknesses: Highly artificial -- allocation pattern is not real, and =
only
+ a few different size allocations are used.
+
sarp:
- Description: Does a lot of stack allocation and deallocation.
- Strengths: Tests for a specific performance bug that existed in 3.1.=
0 and
Added: trunk/perf/heap.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/heap.c (rev 0)
+++ trunk/perf/heap.c 2005-12-17 00:22:39 UTC (rev 5362)
@@ -0,0 +1,39 @@
+#include <stdio.h>
+#include <stdlib.h>
+
+#define NLIVE 1000000
+
+#define NITERS (3*1000*1000)
+
+char* arr[NLIVE];
+
+int main ( void )
+{
+ int i, j, nbytes =3D 0;
+ printf("initialising\n");
+ for (i =3D 0; i < NLIVE; i++)
+ arr[i] =3D NULL;
+
+ printf("running\n");
+ j =3D -1;
+ for (i =3D 0; i < NITERS; i++) {
+ j++;
+ if (j =3D=3D NLIVE) j =3D 0;
+ if (arr[j])=20
+ free(arr[j]);
+ arr[j] =3D malloc(nbytes);
+
+ // Cycle through the sizes 0,8,16,24,32. Zero will get rounded up=
to
+ // 8, so the 8B bucket will get twice as much traffic.
+ nbytes +=3D 8;
+ if (nbytes > 32)
+ nbytes =3D 0;
+ }
+
+ for (i =3D 0; i < NLIVE; i++)
+ if (arr[i])
+ free(arr[i]);
+
+ printf("done\n");
+ return 0;
+}
Added: trunk/perf/heap.vgperf
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/heap.vgperf (rev 0)
+++ trunk/perf/heap.vgperf 2005-12-17 00:22:39 UTC (rev 5362)
@@ -0,0 +1,2 @@
+prog: heap
+tools: none memcheck
Modified: trunk/perf/vg_perf.in
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/vg_perf.in 2005-12-16 17:06:37 UTC (rev 5361)
+++ trunk/perf/vg_perf.in 2005-12-17 00:22:39 UTC (rev 5362)
@@ -319,10 +319,10 @@
# the speedup.
if (not defined $first_tTool{$tool}) {
$first_tTool{$tool} =3D $tTool;
- print(" -----) ");
+ print(" -----) ");
} else {
my $speedup =3D 100 - (100 * $tTool / $first_tTool{$tool=
});
- printf("%5.1f%%) ", $speedup);
+ printf("%5.1f%%) ", $speedup);
}
=20
$num_timings_done++;
|
|
From: Julian S. <js...@ac...> - 2005-12-17 13:26:04
|
Sheesh. Look at this:
P4 Northwood (suse10, x86):
heap trunk : 0.4s nl: 5.7s (12.9x, -----) mc:85.8s (195.0x, -----)
P3 Tualatin (suse10, x86):
heap trunk : 0.8s nl: 6.6s ( 8.4x, -----) mc:63.5s (81.4x, -----)
7447 (suse10, ppc32):
heap trunk : 1.3s nl: 6.2s ( 4.8x, -----) mc:60.2s (47.0x, -----)
Looks like we hit another P4 microarchitectural lemon of some kind.
(or, glibc's malloc implementation is ultra-tuned for P4 but not for
anything else). Anybody self-hosting enthusiasts want to do
cachegrind(memcheck(perf/heap)) to see if there's a lot of cache misses
happening (since this was the cause of the previous performance disaster
on P4 ..)
J
On Saturday 17 December 2005 00:22, sv...@va... wrote:
> Author: njn
> Date: 2005-12-17 00:22:39 +0000 (Sat, 17 Dec 2005)
> New Revision: 5362
>
> Log:
> Add a malloc/free stress test.
>
> Added:
> trunk/perf/heap.c
> trunk/perf/heap.vgperf
> Modified:
> trunk/perf/Makefile.am
> trunk/perf/README
> trunk/perf/vg_perf.in
>
>
> Modified: trunk/perf/Makefile.am
> ===================================================================
> --- trunk/perf/Makefile.am 2005-12-16 17:06:37 UTC (rev 5361)
> +++ trunk/perf/Makefile.am 2005-12-17 00:22:39 UTC (rev 5362)
> @@ -6,10 +6,11 @@
> bz2.vgperf \
> fbench.vgperf \
> ffbench.vgperf \
> + heap.vgperf \
> sarp.vgperf
>
> check_PROGRAMS = \
> - bigcode bz2 fbench ffbench sarp
> + bigcode bz2 fbench ffbench heap sarp
>
> AM_CFLAGS = $(WERROR) -Winline -Wall -Wshadow -g -O
> AM_CPPFLAGS = -I$(top_srcdir) -I$(top_srcdir)/include
> -I$(top_builddir)/include
>
> Modified: trunk/perf/README
> ===================================================================
> --- trunk/perf/README 2005-12-16 17:06:37 UTC (rev 5361)
> +++ trunk/perf/README 2005-12-17 00:22:39 UTC (rev 5362)
> @@ -13,6 +13,15 @@
> of runtime, particularly on larger programs.
> - Weaknesses: Highly artificial.
>
> +heap:
> +- Description: Does a lot of heap allocation and deallocation, and has a
> lot + of heap blocks live while doing so.
> +- Strengths: Stress test for an important sub-system; bug #105039 showed
> + that inefficiencies in heap allocation can make a big
> + difference to programs that allocate a lot.
> +- Weaknesses: Highly artificial -- allocation pattern is not real, and
> only + a few different size allocations are used.
> +
> sarp:
> - Description: Does a lot of stack allocation and deallocation.
> - Strengths: Tests for a specific performance bug that existed in 3.1.0
> and
>
> Added: trunk/perf/heap.c
> ===================================================================
> --- trunk/perf/heap.c (rev 0)
> +++ trunk/perf/heap.c 2005-12-17 00:22:39 UTC (rev 5362)
> @@ -0,0 +1,39 @@
> +#include <stdio.h>
> +#include <stdlib.h>
> +
> +#define NLIVE 1000000
> +
> +#define NITERS (3*1000*1000)
> +
> +char* arr[NLIVE];
> +
> +int main ( void )
> +{
> + int i, j, nbytes = 0;
> + printf("initialising\n");
> + for (i = 0; i < NLIVE; i++)
> + arr[i] = NULL;
> +
> + printf("running\n");
> + j = -1;
> + for (i = 0; i < NITERS; i++) {
> + j++;
> + if (j == NLIVE) j = 0;
> + if (arr[j])
> + free(arr[j]);
> + arr[j] = malloc(nbytes);
> +
> + // Cycle through the sizes 0,8,16,24,32. Zero will get rounded up
> to + // 8, so the 8B bucket will get twice as much traffic.
> + nbytes += 8;
> + if (nbytes > 32)
> + nbytes = 0;
> + }
> +
> + for (i = 0; i < NLIVE; i++)
> + if (arr[i])
> + free(arr[i]);
> +
> + printf("done\n");
> + return 0;
> +}
>
> Added: trunk/perf/heap.vgperf
> ===================================================================
> --- trunk/perf/heap.vgperf (rev 0)
> +++ trunk/perf/heap.vgperf 2005-12-17 00:22:39 UTC (rev 5362)
> @@ -0,0 +1,2 @@
> +prog: heap
> +tools: none memcheck
>
> Modified: trunk/perf/vg_perf.in
> ===================================================================
> --- trunk/perf/vg_perf.in 2005-12-16 17:06:37 UTC (rev 5361)
> +++ trunk/perf/vg_perf.in 2005-12-17 00:22:39 UTC (rev 5362)
> @@ -319,10 +319,10 @@
> # the speedup.
> if (not defined $first_tTool{$tool}) {
> $first_tTool{$tool} = $tTool;
> - print(" -----) ");
> + print(" -----) ");
> } else {
> my $speedup = 100 - (100 * $tTool / $first_tTool{$tool});
> - printf("%5.1f%%) ", $speedup);
> + printf("%5.1f%%) ", $speedup);
> }
>
> $num_timings_done++;
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_idv37&alloc_id865&op=Click
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
|
|
From: Julian S. <js...@ac...> - 2005-12-17 18:05:30
Attachments:
heap_cg_excerpts.txt
|
(running perf/heap.c) > P4 Northwood (suse10, x86): > heap trunk : 0.4s nl: 5.7s (12.9x, -----) mc:85.8s (195.0x, -----) I went looking for cache misses, and found some funny stuff. First thing I found is that the runtime is not proportional to the number of iterations: 300k iters take 3 seconds, 600k take 9. I cachegrinded it. There are two small fns in m_mallocfree.c which cause a lot of misses findSb and swizzle. findSb wades through the superblock list for an arena to find out which superblock a free has happened in. If you run this with -d -d you can see the superblocks being allocated, and the list quickly becomes fairly long. Perhaps a scheme of incrementally reorganising the sb list would help. swizzle is a hack which is used at allocation-time. The perplexing thing is that although the cachegrind per-fn summary lists a huge number of read misses caused by it, the annotated source doesn't show them. (afaics, at least). Relevant excerpts in attached file (for the 600k iter case). J |
|
From: Nicholas N. <nj...@cs...> - 2005-12-17 19:50:32
|
On Sat, 17 Dec 2005, Julian Seward wrote: > (running perf/heap.c) > >> P4 Northwood (suse10, x86): >> heap trunk : 0.4s nl: 5.7s (12.9x, -----) mc:85.8s (195.0x, -----) > > I went looking for cache misses, and found some funny stuff. > > First thing I found is that the runtime is not proportional to the number > of iterations: 300k iters take 3 seconds, 600k take 9. > > I cachegrinded it. > There are two small fns in m_mallocfree.c which cause a lot of misses > findSb and swizzle. findSb wades through the superblock list for an > arena to find out which superblock a free has happened in. If you run > this with -d -d you can see the superblocks being allocated, and the > list quickly becomes fairly long. Perhaps a scheme of incrementally > reorganising the sb list would help. > > swizzle is a hack which is used at allocation-time. The perplexing > thing is that although the cachegrind per-fn summary lists a huge > number of read misses caused by it, the annotated source doesn't show > them. (afaics, at least). I did the same yesterday and saw similar things. The counts in swizzle just don't add up to the function totals. I'm not sure why this is, inlining can sometimes make things confusing, but perhaps it's a Cachegrind bug. I'll take a look when I have time. Nick |
|
From: Dirk M. <dm...@gm...> - 2005-12-19 14:31:19
|
On Saturday 17 December 2005 20:50, Nicholas Nethercote wrote: > I did the same yesterday and saw similar things. The counts in swizzle > just don't add up to the function totals. I'm not sure why this is, > inlining can sometimes make things confusing, but perhaps it's a > Cachegrind bug. I'll take a look when I have time. It makes sense to compile with -fno-reorder-blocks -fno-inline if you want to callgrind an application (which greatly helps finding out the bottlenecks (if you keep in mind that the picture is a bit scewed by missing inlining)). Dirk |
|
From: Julian S. <js...@ac...> - 2005-12-17 20:40:25
|
On Saturday 17 December 2005 19:50, Nicholas Nethercote wrote: > On Sat, 17 Dec 2005, Julian Seward wrote: > > (running perf/heap.c) > > > >> P4 Northwood (suse10, x86): > >> heap trunk : 0.4s nl: 5.7s (12.9x, -----) mc:85.8s (195.0x, > >> -----) > I did the same yesterday and saw similar things. The counts in swizzle > just don't add up to the function totals. Nevertheless, as usual cachegrind does a great job of pointing out the smoking gun. Run time of this program is literally halved following r5365. J |
|
From: Nicholas N. <nj...@cs...> - 2005-12-19 17:25:23
|
On Sat, 17 Dec 2005, Julian Seward wrote: >>>> P4 Northwood (suse10, x86): >>>> heap trunk : 0.4s nl: 5.7s (12.9x, -----) mc:85.8s (195.0x, >>>> -----) > >> I did the same yesterday and saw similar things. The counts in swizzle >> just don't add up to the function totals. > > Nevertheless, as usual cachegrind does a great job of pointing out the > smoking gun. Run time of this program is literally halved following > r5365. I get a 25% speedup. I tried a couple of real programs (konqueror, vim) but don't see any effect on them. Still, it can't have hurt. Nick |
|
From: Julian S. <js...@ac...> - 2005-12-19 17:44:05
|
On Monday 19 December 2005 17:25, Nicholas Nethercote wrote: > On Sat, 17 Dec 2005, Julian Seward wrote: > >>>> P4 Northwood (suse10, x86): > >>>> heap trunk : 0.4s nl: 5.7s (12.9x, -----) mc:85.8s (195.0x, > >>>> -----) > >> > >> I did the same yesterday and saw similar things. The counts in swizzle > >> just don't add up to the function totals. > > > > Nevertheless, as usual cachegrind does a great job of pointing out the > > smoking gun. Run time of this program is literally halved following > > r5365. > > I get a 25% speedup. I tried a couple of real programs (konqueror, vim) > but don't see any effect on them. Still, it can't have hurt. I'm getting the impression that the cache-related performance problems we've identified recently exist on all platforms, but are most pronounced on older P4s, due to the high clock rate, small D1 (8k) and small L2 (256k). There was also a small change in starting a real program (ktuberling), 95 to 93 seconds. On a Mac Mini, which has more generous cache arrangements, I saw a change from a 47x slowdown to 33x, IIRC. I suspect that's more typical. J |