|
From: <sv...@va...> - 2005-12-10 23:11:36
|
Author: njn
Date: 2005-12-10 23:11:28 +0000 (Sat, 10 Dec 2005)
New Revision: 5323
Log:
First attempt at some performance tracking tools. Includes a script vg_p=
erf
(use "make perf" to run) that executes test programs and times their
slowdowns under various tools. It works a lot like the vg_regtest script=
.
It's a bit rough around the edges -- eg. you can't currently directly
compare two different versions of Valgrind, which would be useful -- but =
it
is a good start.
There are currently two test programs in perf/. More will be added as ti=
me
goes on. This stuff will be built on so that performance changes can be
tracked over time.
Added:
trunk/perf/
trunk/perf/Makefile.am
trunk/perf/ffbench.c
trunk/perf/ffbench.vgperf
trunk/perf/sarp.c
trunk/perf/sarp.vgperf
trunk/perf/vg_perf.in
Modified:
trunk/Makefile.am
trunk/configure.in
Modified: trunk/Makefile.am
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/Makefile.am 2005-12-09 21:01:46 UTC (rev 5322)
+++ trunk/Makefile.am 2005-12-10 23:11:28 UTC (rev 5323)
@@ -16,7 +16,7 @@
# And we want to include Addrcheck in the distro, but not compile/test i=
t.
# Put docs last because building the HTML is slow and we want to get
# everything else working before we try it.
-SUBDIRS =3D include coregrind . tests auxprogs $(TOOLS) helgrind docs
+SUBDIRS =3D include coregrind . tests perf auxprogs $(TOOLS) helgrind do=
cs
DIST_SUBDIRS =3D $(SUBDIRS) addrcheck
=20
SUPP_FILES =3D \
@@ -58,6 +58,10 @@
regtest: check
@PERL@ tests/vg_regtest $(TOOLS)
=20
+## Preprend @PERL@ because tests/vg_per isn't executable
+perf: check
+ @PERL@ perf/vg_perf perf
+
EXTRA_DIST =3D \
ACKNOWLEDGEMENTS \
README_DEVELOPERS \
Modified: trunk/configure.in
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/configure.in 2005-12-09 21:01:46 UTC (rev 5322)
+++ trunk/configure.in 2005-12-10 23:11:28 UTC (rev 5323)
@@ -496,6 +496,8 @@
docs/xml/Makefile
tests/Makefile=20
tests/vg_regtest=20
+ perf/Makefile=20
+ perf/vg_perf
include/Makefile=20
auxprogs/Makefile
coregrind/Makefile=20
Added: trunk/perf/Makefile.am
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/Makefile.am (rev 0)
+++ trunk/perf/Makefile.am 2005-12-10 23:11:28 UTC (rev 5323)
@@ -0,0 +1,17 @@
+
+noinst_SCRIPTS =3D vg_perf
+
+EXTRA_DIST =3D $(noinst_SCRIPTS) \
+ ffbench.vgperf \
+ sarp.vgperf
+
+check_PROGRAMS =3D \
+ ffbench sarp
+
+AM_CFLAGS =3D $(WERROR) -Winline -Wall -Wshadow -g -O
+AM_CPPFLAGS =3D -I$(top_srcdir) -I$(top_srcdir)/include -I$(top_builddir=
)/include
+AM_CXXFLAGS =3D $(AM_CFLAGS)
+
+# Extra stuff
+ffbench_LDADD =3D -lm
+
Added: trunk/perf/ffbench.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/ffbench.c (rev 0)
+++ trunk/perf/ffbench.c 2005-12-10 23:11:28 UTC (rev 5323)
@@ -0,0 +1,382 @@
+// This small program computes a Fast Fourier Transform. It tests
+// Valgrind's handling of FP operations. It is representative of all
+// programs that do a lot of FP operations.
+
+// This program was taken from http://www.fourmilab.ch/. The front page=
of
+// that site says:
+//
+// "Except for a few clearly-marked exceptions, all the material on th=
is
+// site is in the public domain and may be used in any manner without
+// permission, restriction, attribution, or compensation."
+
+/*
+
+ Two-dimensional FFT benchmark
+
+ Designed and implemented by John Walker in April of 1989.
+
+ This benchmark executes a specified number of passes (default
+ 20) through a loop in which each iteration performs a fast
+ Fourier transform of a square matrix (default size 256x256) of
+ complex numbers (default precision double), followed by the
+ inverse transform. After all loop iterations are performed
+ the results are checked against known correct values.
+
+ This benchmark is intended for use on C implementations which
+ define "int" as 32 bits or longer and permit allocation and
+ direct addressing of arrays larger than one megabyte.
+
+ If CAPOUT is defined, the result after all iterations is
+ written as a CA Lab pattern file. This is intended for
+ debugging in case horribly wrong results are obtained on a
+ given machine.
+
+ Archival timings are run with the definitions below set as
+ follows: Float =3D double, Asize =3D 256, Passes =3D 20, CAPOUT not
+ defined.
+
+ Time (seconds) System
+
+ 2393.93 Sun 3/260, SunOS 3.4, C, "-f68881 -O".
+ (John Walker).
+
+ 1928 Macintosh IIx, MPW C 3.0, "-mc68020
+ -mc68881 -elems881 -m". (Hugh Hoover).
+
+ 1636.1 Sun 4/110, "cc -O3 -lm". (Michael McClary).
+ The suspicion is that this is software
+ floating point.
+
+ 1556.7 Macintosh II, A/UX, "cc -O -lm"
+ (Michael McClary).
+
+ 1388.8 Sun 386i/250, SunOS 4.0.1 C
+ "-O /usr/lib/trig.il". (James Carrington).
+
+ 1331.93 Sun 3/60, SunOS 4.0.1, C,
+ "-O4 -f68881 /usr/lib/libm.il"
+ (Bob Elman).
+
+ 1204.0 Apollo Domain DN4000, C, "-cpu 3000 -opt 4".
+ (Sam Crupi).
+
+ 1174.66 Compaq 386/25, SCO Xenix 386 C.
+ (Peter Shieh).
+
+ 1068 Compaq 386/25, SCO Xenix 386,
+ Metaware High C. (Robert Wenig).
+
+ 1064.0 Sun 3/80, SunOS 4.0.3 Beta C
+ "-O3 -f68881 /usr/lib/libm.il". (James Carrin=
gton).
+
+ 1061.4 Compaq 386/25, SCO Xenix, High C 1.4.
+ (James Carrington).
+
+ 1059.79 Compaq 386/25, 387/25, High C 1.4,
+ DOS|Extender 2.2, 387 inline code
+ generation. (Nathan Bender).
+
+ 777.14 Compaq 386/25, IIT 3C87-25 (387 Compatible),
+ High C 1.5, DOS|Extender 2.2, 387 inline
+ code generation. (Nathan Bender).
+
+ 751 Compaq DeskPro 386/33, High C 1.5 + DOS|Extender,
+ 387 code generation. (James Carrington).
+
+ 431.44 Compaq 386/25, Weitek 3167-25, DOS 3.31,
+ High C 1.4, DOS|Extender, Weitek code generation.
+ (Nathan Bender).
+
+ 344.9 Compaq 486/25, Metaware High C 1.6, Phar Lap
+ DOS|Extender, in-line floating point. (Nathan
+ Bender).
+
+ 324.2 Data General Motorola 88000, 16 Mhz, Gnu C.
+
+ 323.1 Sun 4/280, C, "-O4". (Eric Hill).
+
+ 254 Compaq SystemPro 486/33, High C 1.5 + DOS|Extender,
+ 387 code generation. (James Carrington).
+
+ 242.8 Silicon Graphics Personal IRIS, MIPS R2000A,
+ 12.5 Mhz, "-O3" (highest level optimisation).
+ (Mike Zentner).
+
+ 233.0 Sun SPARCStation 1, C, "-O4", SunOS 4.0.3.
+ (Nathan Bender).
+
+ 187.30 DEC PMAX 3100, MIPS 2000 chip.
+ (Robert Wenig).
+
+ 120.46 Sun SparcStation 2, C, "-O4", SunOS 4.1.1.
+ (John Walker).
+
+ 120.21 DEC 3MAX, MIPS 3000, "-O4".
+
+ 98.0 Intel i860 experimental environment,
+ OS/2, data caching disabled. (Kern
+ Sibbald).
+
+ 34.9 Silicon Graphics Indigo=B2, MIPS R4400,
+ 175 Mhz, IRIX 5.2, "-O".
+
+ 32.4 Pentium 133, Windows NT, Microsoft Visual
+ C++ 4.0.
+
+ 17.25 Silicon Graphics Indigo=B2, MIPS R4400,
+ 175 Mhz, IRIX 6.5, "-O3".
+
+ 14.10 Dell Dimension XPS R100, Pentium II 400 MHz,
+ Windows 98, Microsoft Visual C 5.0.
+
+ 10.7 Hewlett-Packard Kayak XU 450Mhz Pentium II,
+ Microsoft Visual C++ 6.0, Windows NT 4.0sp3. (Nathan Bender).
+
+ 5.09 Sun Ultra 2, UltraSPARC V9, 300 MHz, gcc -O3.
+ =20
+ 0.846 Dell Inspiron 9100, Pentium 4, 3.4 GHz, gcc -O3.
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+#include <string.h>
+
+/* The program may be run with Float defined as either float or
+ double. With IEEE arithmetic, the same answers are generated for
+ either floating point mode. */
+
+#define Float double /* Floating point type used in FFT */
+
+#define Asize 256 /* Array edge size */
+#define Passes 20 /* Number of FFT/Inverse passes */
+
+#define max(a,b) ((a)>(b)?(a):(b))
+#define min(a,b) ((a)<=3D(b)?(a):(b))
+
+#ifndef unix
+#ifndef WIN32
+extern char *farmalloc(long s);
+#define malloc(x) farmalloc(x)
+#endif
+#define FWMODE "wb"
+#else
+#define FWMODE "w"
+#endif
+
+/*
+
+ Multi-dimensional fast Fourier transform
+
+ Adapted from Press et al., "Numerical Recipes in C".
+
+*/
+
+#define SWAP(a,b) tempr=3D(a); (a)=3D(b); (b)=3Dtempr
+
+static void fourn(data, nn, ndim, isign)
+ Float data[];
+ int nn[], ndim, isign;
+{
+ register int i1, i2, i3;
+ int i2rev, i3rev, ip1, ip2, ip3, ifp1, ifp2;
+ int ibit, idim, k1, k2, n, nprev, nrem, ntot;
+ Float tempi, tempr;
+ double theta, wi, wpi, wpr, wr, wtemp;
+
+ ntot =3D 1;
+ for (idim =3D 1; idim <=3D ndim; idim++)
+ ntot *=3D nn[idim];
+ nprev =3D 1;
+ for (idim =3D ndim; idim >=3D 1; idim--) {
+ n =3D nn[idim];
+ nrem =3D ntot / (n * nprev);
+ ip1 =3D nprev << 1;
+ ip2 =3D ip1 * n;
+ ip3 =3D ip2 * nrem;
+ i2rev =3D 1;
+ for (i2 =3D 1; i2 <=3D ip2; i2 +=3D ip1) {
+ if (i2 < i2rev) {
+ for (i1 =3D i2; i1 <=3D i2 + ip1 - 2; i1 +=3D 2) {
+ for (i3 =3D i1; i3 <=3D ip3; i3 +=3D ip2) {
+ i3rev =3D i2rev + i3 - i2;
+ SWAP(data[i3], data[i3rev]);
+ SWAP(data[i3 + 1], data[i3rev + 1]);
+ }
+ }
+ }
+ ibit =3D ip2 >> 1;
+ while (ibit >=3D ip1 && i2rev > ibit) {
+ i2rev -=3D ibit;
+ ibit >>=3D 1;
+ }
+ i2rev +=3D ibit;
+ }
+ ifp1 =3D ip1;
+ while (ifp1 < ip2) {
+ ifp2 =3D ifp1 << 1;
+ theta =3D isign * 6.28318530717959 / (ifp2 / ip1);
+ wtemp =3D sin(0.5 * theta);
+ wpr =3D -2.0 * wtemp * wtemp;
+ wpi =3D sin(theta);
+ wr =3D 1.0;
+ wi =3D 0.0;
+ for (i3 =3D 1; i3 <=3D ifp1; i3 +=3D ip1) {
+ for (i1 =3D i3; i1 <=3D i3 + ip1 - 2; i1 +=3D 2) {
+ for (i2 =3D i1; i2 <=3D ip3; i2 +=3D ifp2) {
+ k1 =3D i2;
+ k2 =3D k1 + ifp1;
+ tempr =3D wr * data[k2] - wi * data[k2 + 1];
+ tempi =3D wr * data[k2 + 1] + wi * data[k2];
+ data[k2] =3D data[k1] - tempr;
+ data[k2 + 1] =3D data[k1 + 1] - tempi;
+ data[k1] +=3D tempr;
+ data[k1 + 1] +=3D tempi;
+ }
+ }
+ wr =3D (wtemp =3D wr) * wpr - wi * wpi + wr;
+ wi =3D wi * wpr + wtemp * wpi + wi;
+ }
+ ifp1 =3D ifp2;
+ }
+ nprev *=3D n;
+ }
+}
+#undef SWAP
+
+int main()
+{
+ int i, j, k, l, m, npasses =3D Passes, faedge;
+ Float *fdata /* , *fd */ ;
+ static int nsize[] =3D {0, 0, 0};
+ long fanum, fasize;
+ double mapbase, mapscale, /* x, */ rmin, rmax, imin, imax;
+
+ faedge =3D Asize; /* FFT array edge size */
+ fanum =3D faedge * faedge; /* Elements in FFT array */
+ fasize =3D ((fanum + 1) * 2 * sizeof(Float)); /* FFT array size */
+ nsize[1] =3D nsize[2] =3D faedge;
+
+ fdata =3D (Float *) malloc(fasize);
+ if (fdata =3D=3D NULL) {
+ fprintf(stdout, "Can't allocate data array.\n");
+ exit(1);
+ }
+
+ /* Generate data array to process. */
+
+#define Re(x,y) fdata[1 + (faedge * (x) + (y)) * 2]
+#define Im(x,y) fdata[2 + (faedge * (x) + (y)) * 2]
+
+ memset(fdata, 0, fasize);
+ for (i =3D 0; i < faedge; i++) {
+ for (j =3D 0; j < faedge; j++) {
+ if (((i & 15) =3D=3D 8) || ((j & 15) =3D=3D 8))
+ Re(i, j) =3D 128.0;
+ }
+ }
+
+ for (i =3D 0; i < npasses; i++) {
+/*printf("Pass %d\n", i);*/
+ /* Transform image to frequency domain. */
+ fourn(fdata, nsize, 2, 1);
+
+ /* Back-transform to image. */
+ fourn(fdata, nsize, 2, -1);
+ }
+
+ {
+ double r, ij, ar, ai;
+ rmin =3D 1e10; rmax =3D -1e10;
+ imin =3D 1e10; imax =3D -1e10;
+ ar =3D 0;
+ ai =3D 0;
+
+ for (i =3D 1; i <=3D fanum; i +=3D 2) {
+ r =3D fdata[i];
+ ij =3D fdata[i + 1];
+ ar +=3D r;
+ ai +=3D ij;
+ rmin =3D min(r, rmin);
+ rmax =3D max(r, rmax);
+ imin =3D min(ij, imin);
+ imax =3D max(ij, imax);
+ }
+#ifdef DEBUG
+ printf("Real min %.4g, max %.4g. Imaginary min %.4g, max %.4=
g.\n",
+ rmin, rmax, imin, imax);
+ printf("Average real %.4g, imaginary %.4g.\n",=20
+ ar / fanum, ai / fanum);
+#endif
+ mapbase =3D rmin;
+ mapscale =3D 255 / (rmax - rmin);
+ }
+
+ /* See if we got the right answers. */
+
+ m =3D 0;
+ for (i =3D 0; i < faedge; i++) {
+ for (j =3D 0; j < faedge; j++) {
+ k =3D (Re(i, j) - mapbase) * mapscale;
+ l =3D (((i & 15) =3D=3D 8) || ((j & 15) =3D=3D 8)) ? 255 : 0;
+ if (k !=3D l) {
+ m++;
+ fprintf(stdout,
+ "Wrong answer at (%d,%d)! Expected %d, got %d.\n",
+ i, j, l, k);
+ }
+ }
+ }
+ if (m =3D=3D 0) {
+ fprintf(stdout, "%d passes. No errors in results.\n", npasse=
s);
+ } else {
+ fprintf(stdout, "%d passes. %d errors in results.\n",
+ npasses, m);
+ }
+
+#ifdef CAPOUT
+
+ /* Output the result of the transform as a CA Lab pattern
+ file for debugging. */
+
+ {
+#define SCRX 322
+#define SCRY 200
+#define SCRN (SCRX * SCRY)
+ unsigned char patarr[SCRY][SCRX];
+ FILE *fp;
+
+/* Map user external state numbers to internal state index */
+
+#define UtoI(x) (((((x) >> 1) & 0x7F) | ((x) << 7)) & 0xFF)
+
+ /* Copy data from FFT buffer to map. */
+
+ memset(patarr, 0, sizeof patarr);
+ l =3D (SCRX - faedge) / 2;
+ m =3D (faedge > SCRY) ? 0 : ((SCRY - faedge) / 2);
+ for (i =3D 1; i < faedge; i++) {
+ for (j =3D 0; j < min(SCRY, faedge); j++) {
+ k =3D (Re(i, j) - mapbase) * mapscale;
+ patarr[j + m][i + l] =3D UtoI(k);
+ }
+ }
+
+ /* Dump pattern map to file. */
+
+ fp =3D fopen("fft.cap", "w");
+ if (fp =3D=3D NULL) {
+ fprintf(stdout, "Cannot open output file.\n");
+ exit(0);
+ }
+ putc(':', fp);
+ putc(1, fp);
+ fwrite(patarr, SCRN, 1, fp);
+ putc(6, fp);
+ fclose(fp);
+ }
+#endif
+
+ return 0;
+}
Added: trunk/perf/ffbench.vgperf
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/ffbench.vgperf (rev 0)
+++ trunk/perf/ffbench.vgperf 2005-12-10 23:11:28 UTC (rev 5323)
@@ -0,0 +1,2 @@
+prog: ffbench
+tools: none memcheck
Added: trunk/perf/sarp.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/sarp.c (rev 0)
+++ trunk/perf/sarp.c 2005-12-10 23:11:28 UTC (rev 5323)
@@ -0,0 +1,46 @@
+// This artificial program allocates and deallocates a lot of large obje=
cts
+// on the stack. It is a stress test for Memcheck's set_address_range_p=
erms
+// (sarp) function. Pretty much all Valgrind versions up to 3.1.X do ve=
ry
+// badly on it, ie. a slowdown of at least 100x.
+//
+// It is representative of tsim_arch, the simulator for the University o=
f
+// Texas's TRIPS processor, whose performance under Valgrind is dominate=
d by
+// the handling of one frequently-called function that allocates 8348 by=
tes
+// on the stack.
+
+#include <assert.h>
+#include <time.h>
+
+#define REPS 1000*1000
+
+int f(int i)
+{
+ // This nonsense is just to ensure that the compiler does not optimis=
e
+ // away the stack allocation.
+ char big_array[8348];
+ big_array[0] =3D 12;
+ big_array[2333] =3D 34;
+ big_array[5678] =3D 56;
+ big_array[8347] =3D 78;
+ assert( 8000 =3D=3D (&big_array[8100] - &big_array[100]) );
+ return big_array[i];
+}
+
+int main(void)
+{
+ int i, sum =3D 0;
+
+ struct timespec req;
+ req.tv_sec =3D 0;
+ req.tv_nsec =3D 100*1000*1000; // 0.1s
+
+ // Pause for a bit so that the native run-time is not 0.00, which lea=
ds
+ // to ridiculous slow-down figures.
+ nanosleep(&req, NULL);
+ =20
+ for (i =3D 0; i < REPS; i++) {
+ sum +=3D f(i & 0xff);
+ }
+ return sum % 256;
+}
+
Added: trunk/perf/sarp.vgperf
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/sarp.vgperf (rev 0)
+++ trunk/perf/sarp.vgperf 2005-12-10 23:11:28 UTC (rev 5323)
@@ -0,0 +1,2 @@
+prog: sarp
+tools: none memcheck
Added: trunk/perf/vg_perf.in
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/perf/vg_perf.in (rev 0)
+++ trunk/perf/vg_perf.in 2005-12-10 23:11:28 UTC (rev 5323)
@@ -0,0 +1,368 @@
+#! @PERL@
+##--------------------------------------------------------------------##
+##--- Valgrind performance testing script vg_perf ---##
+##--------------------------------------------------------------------##
+
+# This file is part of Valgrind, a dynamic binary instrumentation
+# framework.
+#
+# Copyright (C) 2005 Nicholas Nethercote
+# nj...@va...
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation; either version 2 of the
+# License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
+# 02111-1307, USA.
+#
+# The GNU General Public License is contained in the file COPYING.
+
+#-----------------------------------------------------------------------=
-----
+# usage: vg_perf [options] <dirs | files>
+#
+# Options:
+# --all: run tests in all subdirs
+# --valgrind: valgrind to use (the directory it's in). Default is the=
one
+# in the current tree.
+#
+# The easiest way is to run all tests in valgrind/ with (assuming you in=
stalled
+# in $PREFIX):
+#
+# perl perf/vg_perf --all
+#
+# You can specify individual files to test, or whole directories, or bot=
h.
+# Directories are traversed recursively, except for ones named, for exam=
ple,=20
+# CVS/ or docs/.
+#
+# Each test is defined in a file <test>.vgperf, containing one or more o=
f the
+# following lines, in any order:
+# - prog: <prog to run> (compulsory)
+# - tools: <Valgrind tools> (compulsory)
+# - args: <args for prog> (default: none)
+# - vgopts: <Valgrind options> (default: none)
+# - prereq: <prerequisite command> (default: none)
+# - cleanup: <post-test cleanup cmd to run> (default: none)
+#
+# The prerequisite command, if present, must return 0 otherwise the test=
is
+# skipped.
+#-----------------------------------------------------------------------=
-----
+
+use warnings;
+use strict;
+
+#-----------------------------------------------------------------------=
-----
+# Global vars
+#-----------------------------------------------------------------------=
-----
+my $usage=3D"vg_perf [--all, --valgrind]\n";
+
+my $tmp=3D"vg_perf.tmp.$$";
+
+# Test variables
+my $vgopts; # valgrind options
+my $prog; # test prog
+my $args; # test prog args
+my $prereq; # prerequisite test to satisfy before running te=
st
+my $cleanup; # cleanup command to run
+my @tools; # which tools are we measuring the program with
+
+# Abbreviations used in output
+my %toolnames =3D (=20
+ none =3D> "nl",
+ memcheck =3D> "mc",
+ cachegrind =3D> "cg",
+ massif =3D> "ms"
+);
+
+# We run each program this many times and choose the best time.
+my $n_runs =3D 3;
+
+my $num_tests_done =3D 0;
+my $num_timings_done =3D 0;
+
+# Starting directory
+chomp(my $tests_dir =3D `pwd`);
+
+# Directory of the Valgrind being measured. Default is the one in the
+# current tree.
+my $vg_dir =3D $tests_dir;
+
+#-----------------------------------------------------------------------=
-----
+# Process command line, setup
+#-----------------------------------------------------------------------=
-----
+
+# If $prog is a relative path, it prepends $dir to it. Useful for two r=
easons:
+#
+# 1. Can prepend "." onto programs to avoid trouble with users who don't=
have
+# "." in their path (by making $dir =3D ".")
+# 2. Can prepend the current dir to make the command absolute to avoid
+# subsequent trouble when we change directories.
+#
+# Also checks the program exists and is executable.
+sub validate_program ($$$$)=20
+{
+ my ($dir, $prog, $must_exist, $must_be_executable) =3D @_;
+
+ # If absolute path, leave it alone. If relative, make it
+ # absolute -- by prepending current dir -- so we can change
+ # dirs and still use it.
+ $prog =3D "$dir/$prog" if ($prog !~ /^\//);
+ if ($must_exist) {
+ (-f $prog) or die "vg_perf: '$prog' not found or not a file ($di=
r)\n";
+ }
+ if ($must_be_executable) {=20
+ (-x $prog) or die "vg_perf: '$prog' not executable ($dir)\n";
+ }
+
+ return $prog;
+}
+
+sub validate_tools($)
+{
+ # XXX: should check they exist!
+ my ($toolnames) =3D @_;
+ my @t =3D split(/\s+/, $toolnames);
+ return @t;
+}
+
+sub process_command_line()=20
+{
+ my $alldirs =3D 0;
+ my @fs;
+ =20
+ for my $arg (@ARGV) {
+ if ($arg =3D~ /^-/) {
+ if ($arg =3D~ /^--all$/) {
+ $alldirs =3D 1;
+ } elsif ($arg =3D~ /^--valgrind=3D(.*)$/) {
+ $vg_dir =3D $1;
+ } else {
+ die $usage;
+ }
+ } else {
+ push(@fs, $arg);
+ }
+ }
+ # Make $vg_dir absolute if not already
+ if ($vg_dir !~ /^\//) { $vg_dir =3D "$tests_dir/$vg_dir"; }
+ validate_program($vg_dir, "./coregrind/valgrind", 1, 1);
+
+ if ($alldirs) {
+ @fs =3D ();
+ foreach my $f (glob "*") {
+ push(@fs, $f) if (-d $f);
+ }
+ }
+
+ (0 !=3D @fs) or die "No test files or directories specified\n";
+
+ return @fs;
+}
+
+#-----------------------------------------------------------------------=
-----
+# Read a .vgperf file
+#-----------------------------------------------------------------------=
-----
+sub read_vgperf_file($)
+{
+ my ($f) =3D @_;
+
+ # Defaults.
+ ($vgopts, $prog, $args, $prereq, $cleanup)
+ =3D ("", undef, "", undef, undef, undef, undef);
+
+ open(INPUTFILE, "< $f") || die "File $f not openable\n";
+
+ while (my $line =3D <INPUTFILE>) {
+ if ($line =3D~ /^\s*#/ || $line =3D~ /^\s*$/) {
+ next;
+ } elsif ($line =3D~ /^\s*vgopts:\s*(.*)$/) {
+ $vgopts =3D $1;
+ } elsif ($line =3D~ /^\s*prog:\s*(.*)$/) {
+ $prog =3D validate_program(".", $1, 0, 0);
+ } elsif ($line =3D~ /^\s*tools:\s*(.*)$/) {
+ @tools =3D validate_tools($1);
+ } elsif ($line =3D~ /^\s*args:\s*(.*)$/) {
+ $args =3D $1;
+ } elsif ($line =3D~ /^\s*prereq:\s*(.*)$/) {
+ $prereq =3D $1;
+ } elsif ($line =3D~ /^\s*cleanup:\s*(.*)$/) {
+ $cleanup =3D $1;
+ } else {
+ die "Bad line in $f: $line\n";
+ }
+ }
+ close(INPUTFILE);
+
+ if (!defined $prog) {
+ $prog =3D ""; # allow no prog for testing error and --help c=
ases
+ }
+ if (0 =3D=3D @tools) {
+ die "vg_perf: missing 'tools' line in $f\n";
+ }
+}
+
+#-----------------------------------------------------------------------=
-----
+# Do one test
+#-----------------------------------------------------------------------=
-----
+# Since most of the program time is spent in system() calls, need this t=
o
+# propagate a Ctrl-C enabling us to quit.
+sub mysystem($)=20
+{
+ (system($_[0]) !=3D 2) or exit 1; # 2 is SIGINT
+}
+
+# Run program N times, return the best wall-clock time.
+sub time_prog($$)
+{
+ my ($cmd, $n) =3D @_;
+ my $tmin =3D 999999;
+ for (my $i =3D 0; $i < $n; $i++) {
+ my $out =3D `$cmd 2>&1 1>/dev/null`;
+ $out =3D~ /walltime: ([\d\.]+)s/;
+ $tmin =3D $1 if ($1 < $tmin);
+ }
+ return $tmin;
+}
+
+sub do_one_test($$)=20
+{
+ my ($dir, $vgperf) =3D @_;
+ $vgperf =3D~ /^(.*)\.vgperf/;
+ my $name =3D $1;
+
+ read_vgperf_file($vgperf);
+
+ if (defined $prereq) {
+ if (system("$prereq") !=3D 0) {
+ printf("%-16s (skipping, prereq failed: $prereq)\n", "$name:=
");
+ return;
+ }
+ }
+
+ printf("%-12s", "$name:");
+
+ my $timecmd =3D "/usr/bin/time -f 'walltime: %es'";
+
+ # Do the native run(s).
+ printf("nt:");
+ my $cmd =3D "$timecmd $prog $args";
+ my $tNative =3D time_prog($cmd, $n_runs);
+ printf("%4.1fs ", $tNative);
+
+ foreach my $tool (@tools) {
+ (defined $toolnames{$tool}) or=20
+ die "unknown tool $tool, please add to %toolnames\n";
+
+ # Do the tool run(s). Set both VALGRIND_LIB and VALGRIND_LIB_IN=
NER
+ # in case this Valgrind was configured with --enable-inner.
+ printf("%s:", $toolnames{$tool});
+ my $vgsetup =3D "VALGRIND_LIB=3D$vg_dir/.in_place "
+ . "VALGRIND_LIB_INNER=3D$vg_dir/.in_place ";
+ my $vgcmd =3D "$vg_dir/coregrind/valgrind "
+ . "--command-line-only=3Dyes --tool=3D$tool -q "
+ . "--memcheck:leak-check=3Dno --addrcheck:leak-check=
=3Dno "
+ . "$vgopts ";
+ my $cmd =3D "$vgsetup $timecmd $vgcmd $prog $args";
+ my $tTool =3D time_prog($cmd, $n_runs);
+ printf("%4.1fs (%4.1fx) ", $tTool, $tTool/$tNative);
+
+ $num_timings_done++;
+ }
+ printf("\n");
+
+ if (defined $cleanup) {
+ (system("$cleanup") =3D=3D 0) or=20
+ print(" ($name cleanup operation failed: $cleanup)\n");
+ }
+
+ $num_tests_done++;
+}
+
+#-----------------------------------------------------------------------=
-----
+# Test one directory (and any subdirs)
+#-----------------------------------------------------------------------=
-----
+sub test_one_dir($$); # forward declaration
+
+sub test_one_dir($$)=20
+{
+ my ($dir, $prev_dirs) =3D @_;
+ $dir =3D~ s/\/$//; # trim a trailing '/'
+
+ # Ignore dirs into which we should not recurse.
+ if ($dir =3D~ /^(BitKeeper|CVS|SCCS|docs|doc)$/) { return; }
+
+ chdir($dir) or die "Could not change into $dir\n";
+
+ # Nb: Don't prepend a '/' to the base directory
+ my $full_dir =3D $prev_dirs . ($prev_dirs eq "" ? "" : "/") . $dir;
+ my $dashes =3D "-" x (50 - length $full_dir);
+
+ my @fs =3D glob "*";
+ my $found_tests =3D (0 !=3D (grep { $_ =3D~ /\.vgperf$/ } @fs));
+
+ if ($found_tests) {
+ print "-- Running tests in $full_dir $dashes\n";
+ }
+ foreach my $f (@fs) {
+ if (-d $f) {
+ test_one_dir($f, $full_dir);
+ } elsif ($f =3D~ /\.vgperf$/) {
+ do_one_test($full_dir, $f);
+ }
+ }
+ if ($found_tests) {
+ print "-- Finished tests in $full_dir $dashes\n";
+ }
+
+ chdir("..");
+}
+
+#-----------------------------------------------------------------------=
-----
+# Summarise results
+#-----------------------------------------------------------------------=
-----
+sub summarise_results=20
+{
+ printf("\n=3D=3D %d programs, %d timings =3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D\n\n",=20
+ $num_tests_done, $num_timings_done);
+}
+
+#-----------------------------------------------------------------------=
-----
+# main()
+#-----------------------------------------------------------------------=
-----
+
+# nuke VALGRIND_OPTS
+$ENV{"VALGRIND_OPTS"} =3D "";
+
+my @fs =3D process_command_line();
+foreach my $f (@fs) {
+ if (-d $f) {
+ test_one_dir($f, "");
+ } else {=20
+ # Allow the .vgperf suffix to be given or omitted
+ if ($f =3D~ /.vgperf$/ && -r $f) {
+ # do nothing
+ } elsif (-r "$f.vgperf") {
+ $f =3D "$f.vgperf";
+ } else {
+ die "`$f' neither a directory nor a readable test file/name\=
n"
+ }
+ my $dir =3D `dirname $f`; chomp $dir;
+ my $file =3D `basename $f`; chomp $file;
+ chdir($dir) or die "Could not change into $dir\n";
+ do_one_test($dir, $file);
+ chdir($tests_dir);
+ }
+}
+summarise_results();
+
+##--------------------------------------------------------------------##
+##--- end ---##
+##--------------------------------------------------------------------##
|
|
From: Julian S. <js...@ac...> - 2005-12-12 02:17:13
|
> First attempt at some performance tracking tools. This is great stuff. Here are some prelim numbers. I'm disregarding sarp for the time being (will get to that). Hence, for ffbench: P4 Northwood 1.7GHz nt: 4.7s nl:11.3s ( 2.4x) mc:25.0s ( 5.3x) P3 Tualatin 1.13GHz nt: 6.3s nl:11.4s ( 1.8x) mc:30.2s ( 4.8x) MPC7447A 1.25Ghz (ppc G4) nt: 5.4s nl: 8.2s ( 1.5x) mc:25.8s ( 4.8x) ffbench is atypically favourable for V. The inner loop consists of one very long basic block, which vex's IR optimisation does well on, and the expensive fixed cost of jumping between bbs is pretty small. What are we to make from this? First off, it's nice to see that the ppc compilation pipeline produces code quality at least as good as x86, if not better. Perhaps ppc is a bit of an easier target; the condition code stuff is not quite as difficult to simulate as on x86, and it doesn't have the FP register stack idiocy to contend with. Interesting that P4 falls relatively far behind here with 'none' (nl). Given that the P3 is running an identical Linux distro and Valgrind setup, the performance differences must be microarchitectural, and, I'm betting, center around the P4's worse behaviour on branch mispredicts. Curious to see though that P4 makes up ground with memcheck (2.4x -> 5.3x) as compared to the 7447's showing (1.5x -> 4.8x). Perhaps the P4's aggressive out-of-orderness chews through the memcheck instrumentation and helper calls better than the 7447's relatively modest superscalar implementation. Here are the numbers for sarp (which is a bad case for memcheck): P4 Northwood 1.7GHz nt: 0.1s nl: 0.5s ( 4.8x) mc:20.3s (184.8x) P3 Tualatin 1.13GHz nt: 0.1s nl: 0.6s ( 5.3x) mc:29.3s (266.0x) MPC7447A 1.25Ghz (ppc G4) nt: 0.1s nl: 0.6s ( 5.2x) mc:22.0s (199.7x) In this case, I'm wary of trusting these ratios much given that the run time of the native case is small enough (<= 0.1s) that measurement noise could be significant. How about the following suggestion: all programs in the performance suite take a single command line arg, an integer, which controls how many iterations of the basic work-unit are to be done. The perl script starts off feeding it '1', then increasing it (exponentially) until the native run time exceeds some minimum value for reliable timing, say 1 second. Doing this would get us reliable numbers on very fast machines without making it run inordinately long on slower machines. J |
|
From: Dirk M. <dm...@gm...> - 2005-12-12 08:36:36
|
On Monday 12 December 2005 03:16, Julian Seward wrote: > second. Doing this would get us reliable numbers on very fast machines > without making it run inordinately long on slower machines. It would also be useful to "pre-heat" the CPU, otherwise measurements are basically useless on speedstepping processor architectures (it takes 0.1-0.3s of CPU load until you can trust that the CPU actually runs at full speed). Dirk |
|
From: Nicholas N. <nj...@cs...> - 2005-12-12 16:27:43
|
On Mon, 12 Dec 2005, Julian Seward wrote: > Hence, for ffbench: > > P4 Northwood 1.7GHz nt: 4.7s nl:11.3s ( 2.4x) mc:25.0s ( 5.3x) > P3 Tualatin 1.13GHz nt: 6.3s nl:11.4s ( 1.8x) mc:30.2s ( 4.8x) > MPC7447A 1.25Ghz (ppc G4) nt: 5.4s nl: 8.2s ( 1.5x) mc:25.8s ( 4.8x) > > Here are the numbers for sarp (which is a bad case for memcheck): > > P4 Northwood 1.7GHz nt: 0.1s nl: 0.5s ( 4.8x) mc:20.3s (184.8x) > P3 Tualatin 1.13GHz nt: 0.1s nl: 0.6s ( 5.3x) mc:29.3s (266.0x) > MPC7447A 1.25Ghz (ppc G4) nt: 0.1s nl: 0.6s ( 5.2x) mc:22.0s (199.7x) Here are my numbers on a dual P4 3.0 GHz: ffbench: nt: 0.8s nl: 4.2s ( 5.0x) mc:10.7s (12.7x) sarp: nt: 0.1s nl: 0.3s ( 2.9x) mc:13.7s (124.2x) Much worse than yours. I'm not sure what kind of P4 it is; /proc/cpuinfo says (this info repeated twice, one per CPU): processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 4 cpu MHz : 2992.664 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 3 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni monitor ds_cpl cid bogomips : 5976.88 If you're right about the branch prediction, perhaps this machine has a longer pipeline and so mispredicts are hitting harder? The site that has the ffbench program has another another one, fbench, which does some different FP operations. I think I'll add that to the suite. > In this case, I'm wary of trusting these ratios much given that the run > time of the native case is small enough (<= 0.1s) that measurement noise > could be significant. If you look at the code I inserted a 0.1s nanosleep to mitigate this; remove that and natively it will probably be measured as 0.00s. So the slow-down is even worse than 100--200x. I'm imagining that the performance suite will consist of some small-but-real programs (eg. ffbench), and artificial programs like sarp that test specific cases -- it shows a specific performance bug in Memcheck, in that a simple operation (a large change in the SP) becomes many operations (Memcheck has to set all the affected A+V bits). And this program does much better in the COMPVBITS branch: sarp: nt: 0.1s nl: 0.3s ( 2.9x) mc: 2.9s (26.2x) It's interesting to see that 2.4.X does very poorly on ffbench under Memcheck (under Nulgrind it's only slightly slower than 3.1.X): ffbench: nt: 0.8s nl: 4.9s ( 6.0x) mc:40.8s (49.7x) sarp: nt: 0.1s nl: 0.2s ( 2.1x) mc:11.1s (100.6x) > How about the following suggestion: all programs in the performance > suite take a single command line arg, an integer, which controls how many > iterations of the basic work-unit are to be done. The perl script > starts off feeding it '1', then increasing it (exponentially) until the > native run time exceeds some minimum value for reliable timing, say 1 > second. Doing this would get us reliable numbers on very fast machines > without making it run inordinately long on slower machines. It is a good idea to build in compensation for different processor speeds. The details are tricky; if we mandate a 1 second minimum for native, sarp will run for a couple of minutes, which is a pain. The minimum time could be a parameter in the .vgperf file, perhaps. As well as consistency across different machines, consistency on individual machines will be important -- ie. we want to get similar results on each run. This will be important when I get around to adding some kind of performance-tracking infrastructure. It will take some more programs and experience to see how to handle this. Nick |
|
From: Josef W. <Jos...@gm...> - 2005-12-12 22:48:52
|
On Monday 12 December 2005 17:27, Nicholas Nethercote wrote: > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni monitor > ds_cpl cid It is a Prescott. "pni" means prescott new instructions, ie. SSE 3. Josef |
|
From: Tom H. <to...@co...> - 2005-12-12 17:38:42
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
>
> > First attempt at some performance tracking tools.
>
> This is great stuff. Here are some prelim numbers. I'm disregarding
> sarp for the time being (will get to that). Hence, for ffbench:
>
> P4 Northwood 1.7GHz nt: 4.7s nl:11.3s ( 2.4x) mc:25.0s ( 5.3x)
> P3 Tualatin 1.13GHz nt: 6.3s nl:11.4s ( 1.8x) mc:30.2s ( 4.8x)
> MPC7447A 1.25Ghz (ppc G4) nt: 5.4s nl: 8.2s ( 1.5x) mc:25.8s ( 4.8x)
Athlon XP 2100+ (1.7Ghz) nt: 3.6s nl: 6.6s ( 1.8x) mc:16.0s ( 4.4x)
Opteron 250 (2.4Ghz) nt: 1.1s nl: 2.4s ( 2.2x) mc: 9.0s ( 8.4x)
> P4 Northwood 1.7GHz nt: 0.1s nl: 0.5s ( 4.8x) mc:20.3s (184.8x)
> P3 Tualatin 1.13GHz nt: 0.1s nl: 0.6s ( 5.3x) mc:29.3s (266.0x)
> MPC7447A 1.25Ghz (ppc G4) nt: 0.1s nl: 0.6s ( 5.2x) mc:22.0s (199.7x)
Athlon XP 2100+ (1.7Ghz) nt: 0.1s nl: 0.4s ( 4.0x) mc:15.0s (149.6x)
Opteron 250 (2.4Ghz) nt: 0.1s nl: 0.3s ( 3.2x) mc:12.8s (127.8x)
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2005-12-12 17:12:53
|
> > P4 Northwood 1.7GHz nt: 4.7s nl:11.3s ( 2.4x) mc:25.0s ( 5.3x) > > P3 Tualatin 1.13GHz nt: 6.3s nl:11.4s ( 1.8x) mc:30.2s ( 4.8x) > > MPC7447A 1.25Ghz (ppc G4) nt: 5.4s nl: 8.2s ( 1.5x) mc:25.8s ( 4.8x) > > Here are my numbers on a dual P4 3.0 GHz: > > ffbench: nt: 0.8s nl: 4.2s ( 5.0x) mc:10.7s (12.7x) > sarp: nt: 0.1s nl: 0.3s ( 2.9x) mc:13.7s (124.2x) > > Much worse than yours. I'm not sure what kind of P4 it is; /proc/cpuinfo > says (this info repeated twice, one per CPU): It might be a P4 Prescott. Can you find out? There were two different P4 incarnations with significantly different uarchitectures. Wilamette/Northwood was the original one, with a ~20 stage pipe, whereas Prescott has 31 stages. (I think Prescotts are labelled "Pentium 4 3.0 E", where the "E" is the clue.) > If you're right about the branch prediction, perhaps this machine has a > longer pipeline and so mispredicts are hitting harder? Maybe. Have got tired of listening to myself wittering on about branch mispredicts and am in mid-experiment to try and build a more-or-less-mispredict-free dispatcher. > > In this case, I'm wary of trusting these ratios much given that the run > > time of the native case is small enough (<= 0.1s) that measurement noise > > could be significant. > > If you look at the code I inserted a 0.1s nanosleep to mitigate this; > remove that and natively it will probably be measured as 0.00s. So the > slow-down is even worse than 100--200x. Ehm ... nanosleep causes the process to be descheduled and so won't it have no effect on the total CPU time? You're measuring CPU and not wallclock, right? > It's interesting to see that 2.4.X does very poorly on ffbench under > Memcheck (under Nulgrind it's only slightly slower than 3.1.X): > > ffbench: nt: 0.8s nl: 4.9s ( 6.0x) mc:40.8s (49.7x) > sarp: nt: 0.1s nl: 0.2s ( 2.1x) mc:11.1s (100.6x) Yes. That's due to the UCode JIT being microarchitecturally naive and doing a lot of fxsave/fxrestors around FP isns, with catastropic effects on performance. That's a baseline (nl) overhead though - I'm surprised it carries over into memcheck too. Ah well. > It is a good idea to build in compensation for different processor speeds. > The details are tricky; if we mandate a 1 second minimum for native, sarp > will run for a couple of minutes, which is a pain. As Dirk points out, we need at least a ~0.3s minimum. I don't mind if the benchmark suite takes several minutes to complete. Anyway, once COMPVBITS is merged, sarp only has a slowdown of 26, right? J |
|
From: Cerion Armour-B. <ce...@op...> - 2005-12-12 18:28:28
|
On Monday 12 December 2005 18:38, Tom Hughes wrote: > In message <200...@ac...> > > Julian Seward <js...@ac...> wrote: > > > First attempt at some performance tracking tools. > > > > Hence, for ffbench: > > > > P4 Northwood 1.7GHz nt: 4.7s nl:11.3s ( 2.4x) mc:25.0s ( 5.3x) > > P3 Tualatin 1.13GHz nt: 6.3s nl:11.4s ( 1.8x) mc:30.2s ( 4.8x) > > MPC7447A 1.25Ghz (ppc G4) nt: 5.4s nl: 8.2s ( 1.5x) mc:25.8s ( 4.8x) > > Athlon XP 2100+ (1.7Ghz) nt: 3.6s nl: 6.6s ( 1.8x) mc:16.0s ( 4.4x) > Opteron 250 (2.4Ghz) nt: 1.1s nl: 2.4s ( 2.2x) mc: 9.0s ( 8.4x) PPC970FX (2.5GHz) nt: 2.3s nl: 3.6s ( 1.6x) mc:11.6s ( 5.1x) > > Here are the numbers for sarp (which is a bad case for memcheck): > > > > P4 Northwood 1.7GHz nt: 0.1s nl: 0.5s ( 4.8x) mc:20.3s (184.8x) > > P3 Tualatin 1.13GHz nt: 0.1s nl: 0.6s ( 5.3x) mc:29.3s (266.0x) > > MPC7447A 1.25Ghz (ppc G4) nt: 0.1s nl: 0.6s ( 5.2x) mc:22.0s (199.7x) > > Athlon XP 2100+ (1.7Ghz) nt: 0.1s nl: 0.4s ( 4.0x) mc:15.0s (149.6x) > Opteron 250 (2.4Ghz) nt: 0.1s nl: 0.3s ( 3.2x) mc:12.8s (127.8x) PPC970FX (2.5GHz) nt: 0.1s nl: 0.3s ( 3.2x) mc:10.9s (108.9x) Cerion |
|
From: Nicholas N. <nj...@cs...> - 2005-12-12 19:42:08
|
On Mon, 12 Dec 2005, Cerion Armour-Brown wrote: >>> Hence, for ffbench: >>> >>> P4 Northwood 1.7GHz nt: 4.7s nl:11.3s ( 2.4x) mc:25.0s ( 5.3x) >>> P3 Tualatin 1.13GHz nt: 6.3s nl:11.4s ( 1.8x) mc:30.2s ( 4.8x) >>> MPC7447A 1.25Ghz (ppc G4) nt: 5.4s nl: 8.2s ( 1.5x) mc:25.8s ( 4.8x) >> >> Athlon XP 2100+ (1.7Ghz) nt: 3.6s nl: 6.6s ( 1.8x) mc:16.0s ( 4.4x) >> Opteron 250 (2.4Ghz) nt: 1.1s nl: 2.4s ( 2.2x) mc: 9.0s ( 8.4x) > > PPC970FX (2.5GHz) nt: 2.3s nl: 3.6s ( 1.6x) mc:11.6s ( 5.1x) This is great! :) I look forward to seeing more numbers as we build up the performance suite. Nick |
|
From: Nicholas N. <nj...@cs...> - 2005-12-12 19:47:48
|
On Mon, 12 Dec 2005, Julian Seward wrote: > It might be a P4 Prescott. Can you find out? There were two different P4 > incarnations with significantly different uarchitectures. Wilamette/Northwood > was the original one, with a ~20 stage pipe, whereas Prescott has 31 stages. > (I think Prescotts are labelled "Pentium 4 3.0 E", where the "E" is the > clue.) How do I find out? I can't see anything relevant on the box. > Ehm ... nanosleep causes the process to be descheduled and so won't > it have no effect on the total CPU time? You're measuring CPU and > not wallclock, right? I'm measuring wall-clock (real). Should I be measuring user time? Nick |
|
From: Nicholas N. <nj...@cs...> - 2005-12-12 20:42:51
|
On Mon, 12 Dec 2005, Nicholas Nethercote wrote: > How do I find out? I can't see anything relevant on the box. Ok, according to http://gentoo-wiki.com/Safe_Cflags#Pentium_4_.28Prescott.29_.28Intel.29, if cpu family is 15 and model is 4 it's a Prescott. So, super-long pipeline. Nick |