From: Christophe R. <cs...@ca...> - 2004-03-02 16:06:06
|
Andreas Fuchs <as...@bo...> writes: > Hi all, > > this is a test of the emergency autobuilder & benchmark runner. In the > next few days, you can expect to see more of these in this place. If > all is well, there will be a mail sent to the list once per day (if > there were commits), starting this Thursday. Excellent! > I believe it should be possible and interesting to make a build farm > with this set of scripts (especially to see performance > improvements/degradation in non-x86 backends as well). All you would > need is rsync, tla, ploticus (perhaps, I'm thinking about a version > without ploticus) and a reasonably fast machine. If you're interested > in running this benchmark, just ask (-: I think in particular a fast PowerPC benchmarker (again with historical data inasmuch as this is possible) would be very useful. My suspcion is that there are things that are sufficiently dissimilar between the PowerPC and x86 platforms that will make this interesting. > The numbers below were generated by running the emarsden benchmarks > three times. The format for the reference column is as follows: > [mean(samples_ref) | standard error(samples_ref)] > The others are relative to the reference: > (mean(samples_version) / mean(samples_ref)) | standard error (samples_version) Just for the record, what I would like to see quoted for the relative times are mean(samples_version) / mean(samples_ref) and standard error(samples_version) / mean(samples_ref) Since I'm aware that not everyone has done as much statistical theory as they should^W^WI have (and even if they have they may have a slightly different convention for presenting results), let me go into this a little more. People not interested in the mathematical detail can safely elide some of this. Imagine taking k samples from a distribution X, with a view to measuring the mean, \mu, of X. By taking the mean of the sample, we get an estimate for the population mean. Label the samples x_1 ... x_k, and compute a statistic \bar{x} = \frac{1}{k} \sum_i x_i. This statistic is an unbiased estimator for the mean of the population: E_X(\bar{x}) = E(\frac{1}{k} \sum_i x_i) = \frac{1}{k} \sum_i E(x_i) = \frac{1}{k} \sum_i \mu = \mu. However, not only do we want to know an estimate for the population mean (in this specific case, "how much time does it take to run this benchmark?"), but we also want to know how wide of the mark our estimate could be. For this, we want to compute the variance of our statistic: Var_X(\bar{x}) = Var(\frac{1}{k} \sum_i x_i) = E([\frac{1}{k} \sum_i x_i]^2) - E(\frac{1}{k} \sum_i x_i)^2 = \frac{1}{k^2}E([\sum_i x_i]^2) - \mu^2 = \frac{1}{k^2}(kE(x^2) + k(k-1)\mu^2) - \mu^2 = \frac{1}{k}(E(x^2) - \mu^2) = \frac{1}{k}\sigma^2 where \sigma^2 is the variance of the population X. So to get an estimate of the error on a given estimate of the mean, we estimate the standard deviation of the population, and divide by the square root of the number of samples.[*] This is what I mean, at least, when I talk about the "standard error" or "standard error on the mean", and it's what physicists quote when the say "foo was measured to be 20 +/- 3". So, with all that said (and even taking account of the fact that the calculation of the errors presented in Andreas' e-mail is bogus for mean times not equalling 1) there are some oddities in the benchmark results. I've snipped to leave in just the odd ones, but I'll reserve further comment until more data arrive. Cheers, Christophe [*] In case you're wondering, no, this bit isn't rigorous. It's good enough for Physics, though :-) > Benchmark Reference 0.8.8.3 0.8.8.1 > ------------------------------------------------------------------------------------- > LOAD-FASL [ 1.60|0.00] 0.95|0.00 1.01|0.00 > SUM-PERMUTATIONS [ 8.02|0.01] 1.05|0.01 0.99|0.00 > FPRINT/UGLY [ 3.27|0.00] 1.10|0.00 0.99|0.00 > CLOS/simple-instantiate [ 0.76|0.00] 1.04|0.00 1.04|0.00 > CLOS/complex-methods [ 2.21|0.01] 1.06|0.01 1.00|0.01 Incidentally, this presentation isn't terribly easy to read... the graphs are a lot better, fortunately. For this, I think it would help if (a) the Reference were something altogether different (say, CMUCL, or a long-distant version of sbcl), and (b) if the remaining columns were arranged chronologically rightwards rather than leftwards (or than in order of performing the benchmarks, even worse... :-) Other than that, it's looking good. Thanks very much for your work on this! Cheers, Christophe -- http://www-jcsu.jesus.cam.ac.uk/~csr21/ +44 1223 510 299/+44 7729 383 757 (set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b))) (defvar b "~&Just another Lisp hacker~%") (pprint #36rJesusCollegeCambridge) |