From: Philipp K. Janert <janert@ie...>  20080523 23:15:36

I just submitted a patch (1970923) which draws a smooth histogramlike curve for a random collection of points, using a Gaussian kernel density estimation algorithm. Demos are found here: http://www.philippjanert.com/kdensity The new method has the following advantages over the classic way of generating histograms using "smooth frequency":  the resulting histogram is a smooth curve, making the effect of binning less severe  it handles intermediate "bins" with no points in them gracefully. (smooth freq does so only if used "with boxes", but if you use "with lines" for example, the line will not drop to zero if an intermediate bin is empty) The method is invoked like a weighted smoothing algorithm: plot "data" u 1:(1):(1) smooth kdensity where the 2nd parameter is the weight of each point and the 3rd parameter is the bandwidth to be used. This patch complements the "smooth cumulative" algorithm as another way to visualize the distribution of a collection of random points. Comments and suggestions are welcome. Best, Ph. 
From: <plotter@pi...>  20080524 07:10:59

On Sat, 24 May 2008 01:15:37 +0200, Philipp K. Janert <janert@...> wrote: > I just submitted a patch (1970923) which > draws a smooth histogramlike curve > for a random collection of points, using > a Gaussian kernel density estimation > algorithm. > Demos are found here: > http://www.philippjanert.com/kdensity very interesting. My initial impression on looking at your top left example is that there is a phase shift of +half a box in x most visible in the 0.01 and 0.05 plots. Considering that x=0 is in the middle of the first box it appears that the fits are responding in a way that aligns with the right of each box. It's a bit subjective due to the nature of the data but this is my impression for the peaks at 0.2 0.4 and 0.5 Maybe you could test this effect with a rapid change in the data. I also think there is an egde effect at the begining and end of the data. This is a common problem when applying this sort of technique to image data. How to deal with edges when the kernel goes outside the data. There are several "solutions" which involve falsely extening the data but applying a kernel to an incompete sample range is effectively filling it with zeros and is equally false. It's like a running mean cannot be meaningful upto the edges of the sample range since there are not enough samples to take the mean over. This also gives an artificial drop off at the edges. This is also rather marked near the origin in your lognormal example. In image processing it's just a case of prettying up the edges but in a scientific context this is clearly not appropriate. I think the only rigourous way to deal with this is not to plot the part where the data is incomplete. I hope the comments are useful. best regards, Peter. 
From: Philipp K. Janert <janert@ie...>  20080525 23:41:33

On Saturday 24 May 2008 00:11, you wrote: > On Sat, 24 May 2008 01:15:37 +0200, Philipp K. Janert <janert@...> > > wrote: > > I just submitted a patch (1970923) which > > draws a smooth histogramlike curve > > for a random collection of points, using > > a Gaussian kernel density estimation > > algorithm. > > Demos are found here: > > http://www.philippjanert.com/kdensity > > very interesting. > > My initial impression on looking at your top left example is that there is > a phase shift of +half a box in x most visible in the 0.01 and 0.05 plots. > The edge effect is actually in the histogram, not in the kernel density (yet another advantage of kdensities over histograms: the annoying binplacement problem goes away). The code does what all current gnuplot smoothing algos do: they stop at the min and max data point in the sample. I think this is reasonable. Best, Ph. 
From: <plotter@pi...>  20080526 00:52:05

On Mon, 26 May 2008 01:41:31 +0200, Philipp K. Janert <janert@...> wrote: > On Saturday 24 May 2008 00:11, you wrote: >> On Sat, 24 May 2008 01:15:37 +0200, Philipp K. Janert <janert@...> >> >> wrote: >> > I just submitted a patch (1970923) which >> > draws a smooth histogramlike curve >> > for a random collection of points, using >> > a Gaussian kernel density estimation >> > algorithm. >> > Demos are found here: >> > http://www.philippjanert.com/kdensity >> >> very interesting. >> >> My initial impression on looking at your top left example is that there >> is >> a phase shift of +half a box in x most visible in the 0.01 and 0.05 >> plots. >> > The edge effect is actually in the histogram, > not in the kernel density (yet another advantage > of kdensities over histograms: the annoying > binplacement problem goes away). hmm, never been much of a fan of bins and histograms, that's probably why. More for sociologists and economists. > > The code does what all current gnuplot > smoothing algos do: they stop at the min > and max data point in the sample. I think > this is reasonable. > Well I'm not sure that is comparable. IRRC all the "smoothing" algos (appart from unique) are splines , these are calculated over 4 data. In fact they would require just one point outside the data range at each end. I have not looked how they are dealt with but it is unlikely to be important for one point. However, techniques using a kernel require half the kernel width outside each end of the data range. I would guess by looking at your examples that the missing data are initialised as zero. Is that correct? Sorry to be a stickler for detail, it must be my rigourous physics training coming out. As people become less and less aware of what all these software tools are actually doing for them, it becomes more and more important that they do not introduce distortions. Don't think I knocking your efforts, I'm pretty impressed overall. best regards, Peter. > Best, > > Ph. > 
From: <plotter@pi...>  20080526 01:51:27

On Mon, 26 May 2008 01:41:31 +0200, Philipp K. Janert <janert@...> wrote: > The code does what all current gnuplot > smoothing algos do: they stop at the min > and max data point in the sample. I think > this is reasonable. > Best, > Ph. What I would suggest is that it only produces a plot line over the range where the data is complete. If the sample is large enough for this to be negligable , it won't notice anyway. If it is significant in relation to the data sample, the plot line will stop at the point where it is no longer mathematically valid. That would seem to be the correct thing to do. I see little justification for extending it beyond that range. It must be a trivial change to make if you accept the principal. best regards, Peter. 