Hello,
I have found a bug in gnuplot kernel density calculation. In the
eval_kdensity function of interpol.c, there is a definition of "n =
num_points - 1" and then the data average is calculated as sum/n instead
of sum/num_points. The data average, standard deviation, and default
bandwidth are thus incorrectly calculated. The standard deviation can
even become undefined causing the plot to fail. This applies to the
4.6.0 and earlier versions. A patch for 4.6.0 is attached, as well as
the datasets used for testing.
I have tested this with two datasets (see below), one causing a wrong
plot, the other failing to plot.
The command used is:
plot "dataset" u 1:(1./100) smooth kdensity w l
Without my patch:
DATASET1:
num_points: 100
n: 99
min: 23.23000
max: 48.23000
avg: 35.88505 INCORRECT!
sigma: 3.95375 INCORRECT!
default_bandwidth: 1.66723 INCORRECT!
For dataset1, average, standard deviation, and default bandwidth are
incorrectly calculated but a (wrong) plot is obtained.
DATASET2:
num_points: 100
n: 99
min: 21.03000
max: 25.67000
avg: 23.33657 INCORRECT!
sigma: -nan
default_bandwidth: -nan
Warning: empty y range [0.01:0.01], adjusting to [0.0099:0.0101]
For dataset2 it is worse: as the incorrectly calculated standard
deviation is undefined (sqrt of a negative number), the default
bandwidth is also undefined and there is no visible plot obtained.
With my patch, the correct behavior is obtained:
DATASET1:
num_points: 100
n: 99
min: 23.23000
max: 48.23000
avg: 35.52620
sigma: 5.31266
default_bandwidth: 2.24027
DATASET2:
num_points: 100
n: 99
min: 21.03000
max: 25.67000
avg: 23.10320
sigma: 0.99933
default_bandwidth: 0.42140
Thomas Gaillard
Thomas Gaillard
2012-04-19
interpol.c patch for 4.6.0
Thomas Gaillard
2012-04-19
dataset1
Thomas Gaillard
2012-04-19
dataset2
Thomas Gaillard
2012-04-19
Ethan Merritt
2012-04-19
Thomas Gaillard
2012-06-01
Ethan Merritt
2012-08-23
I have applied your patch to both 4.6 and 4.7
The calculation of the mean was clearly wrong.
However, I'd like someone to confirm whether the kernel density function wants sigma or estimated standard deviation. Given that gnuplot is usually given experimental samples, the original division by (N-1) may be correct.
Philipp?
Ethan Merritt
2013-09-24