#1118 incorrect kernel density calculation

closed-fixed
2D plot (258)
7
2015-03-23
2012-04-19
No

Hello,

I have found a bug in gnuplot kernel density calculation. In the
eval_kdensity function of interpol.c, there is a definition of "n =
num_points - 1" and then the data average is calculated as sum/n instead
of sum/num_points. The data average, standard deviation, and default
bandwidth are thus incorrectly calculated. The standard deviation can
even become undefined causing the plot to fail. This applies to the
4.6.0 and earlier versions. A patch for 4.6.0 is attached, as well as
the datasets used for testing.

I have tested this with two datasets (see below), one causing a wrong
plot, the other failing to plot.

The command used is:

plot "dataset" u 1:(1./100) smooth kdensity w l

Without my patch:

DATASET1:
num_points: 100
n: 99
min: 23.23000
max: 48.23000
avg: 35.88505 INCORRECT!
sigma: 3.95375 INCORRECT!
default_bandwidth: 1.66723 INCORRECT!

For dataset1, average, standard deviation, and default bandwidth are
incorrectly calculated but a (wrong) plot is obtained.

DATASET2:
num_points: 100
n: 99
min: 21.03000
max: 25.67000
avg: 23.33657 INCORRECT!
sigma: -nan
default_bandwidth: -nan
Warning: empty y range [0.01:0.01], adjusting to [0.0099:0.0101]

For dataset2 it is worse: as the incorrectly calculated standard
deviation is undefined (sqrt of a negative number), the default
bandwidth is also undefined and there is no visible plot obtained.

With my patch, the correct behavior is obtained:

DATASET1:
num_points: 100
n: 99
min: 23.23000
max: 48.23000
avg: 35.52620
sigma: 5.31266
default_bandwidth: 2.24027

DATASET2:
num_points: 100
n: 99
min: 21.03000
max: 25.67000
avg: 23.10320
sigma: 0.99933
default_bandwidth: 0.42140

Thomas Gaillard

Related

Patches: #656

Discussion

  • Thomas Gaillard

    Thomas Gaillard - 2012-04-19

    interpol.c patch for 4.6.0

     
  • Thomas Gaillard

    Thomas Gaillard - 2012-04-19

    dataset1

     
    Attachments
  • Thomas Gaillard

    Thomas Gaillard - 2012-04-19

    dataset2

     
    Attachments
  • Thomas Gaillard

    Thomas Gaillard - 2012-04-19
    • priority: 5 --> 7
     
  • Ethan Merritt

    Ethan Merritt - 2012-04-19
    • assigned_to: nobody --> janert
     
  • Thomas Gaillard

    Thomas Gaillard - 2012-06-01
    • status: open --> open-fixed
     
  • Ethan Merritt

    Ethan Merritt - 2012-08-23

    I have applied your patch to both 4.6 and 4.7

    The calculation of the mean was clearly wrong.

    However, I'd like someone to confirm whether the kernel density function wants sigma or estimated standard deviation. Given that gnuplot is usually given experimental samples, the original division by (N-1) may be correct.

    Philipp?

     
  • Ethan Merritt

    Ethan Merritt - 2013-09-24
    • status: open-fixed --> closed-fixed
    • Group: --> 5.0
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks