Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
Close
From: Christopher Fonnesbeck <statistics@me...>  20101022 13:48:11

I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example: http://cl.ly/7e0ad7039873d5446365 http://cl.ly/c7cb20b567722928ac3c Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publicationquality)? Thanks, Chris 
From: Ryan May <rmay31@gm...>  20101022 14:14:01

On Fri, Oct 22, 2010 at 8:47 AM, Christopher Fonnesbeck <statistics@...> wrote: > I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example: > > http://cl.ly/7e0ad7039873d5446365 > http://cl.ly/c7cb20b567722928ac3c > > Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publicationquality)? That looks like some bizarre rounding/truncation or something like it. Can you post an example (can just use made up data) that reproduces this? I've not seen this before, so I sense it's due to the specific data types you're passing in. Ryan  Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma 
From: Christopher Fonnesbeck <statistics@me...>  20101022 14:40:33

On Oct 22, 2010, at 9:13 AM, Ryan May wrote: > > On Fri, Oct 22, 2010 at 8:47 AM, Christopher Fonnesbeck > <statistics@...> wrote: >> I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example: >> >> http://cl.ly/7e0ad7039873d5446365 >> http://cl.ly/c7cb20b567722928ac3c >> >> Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publicationquality)? > > That looks like some bizarre rounding/truncation or something like it. > Can you post an example (can just use made up data) that reproduces > this? I've not seen this before, so I sense it's due to the specific > data types you're passing in. Here is a very simple example. The data are just a list of integers: http://dl.dropbox.com/u/233041/histexample.py and it results in an odd choice of intervals. (array([863, 775, 0, 271, 0, 67, 23, 0, 0, 1]), array([ 0. , 0.6, 1.2, 1.8, 2.4, 3. , 3.6, 4.2, 4.8, 5.4, 6. ]), <a list of 10 Patch objects>) If there are only 7 possible values of the data, which are evenlyspaced, it should probably not go in and create more than 6 bins as the default behavior. I know I can specify bins by hand, but when automated it would be nice to have a more sensible default. Thanks, cf 
From: Ryan May <rmay31@gm...>  20101022 16:12:45

On Fri, Oct 22, 2010 at 9:40 AM, Christopher Fonnesbeck <statistics@...> wrote: > On Oct 22, 2010, at 9:13 AM, Ryan May wrote: >> >> On Fri, Oct 22, 2010 at 8:47 AM, Christopher Fonnesbeck >> <statistics@...> wrote: >>> I notice that when the number of bins in a histogram is sparse, the spacing between the bins can be irregular. For example: >>> >>> http://cl.ly/7e0ad7039873d5446365 >>> http://cl.ly/c7cb20b567722928ac3c >>> >>> Is there a way of normalizing this, and better, can the default behavior result in something more consistent (i.e. publicationquality)? >> >> That looks like some bizarre rounding/truncation or something like it. >> Can you post an example (can just use made up data) that reproduces >> this? I've not seen this before, so I sense it's due to the specific >> data types you're passing in. > > Here is a very simple example. The data are just a list of integers: > > http://dl.dropbox.com/u/233041/histexample.py > > and it results in an odd choice of intervals. > > (array([863, 775, 0, 271, 0, 67, 23, 0, 0, 1]), > array([ 0. , 0.6, 1.2, 1.8, 2.4, 3. , 3.6, 4.2, 4.8, 5.4, 6. ]), > <a list of 10 Patch objects>) > > If there are only 7 possible values of the data, which are evenlyspaced, it should probably not go in and create more than 6 bins as the default behavior. I know I can specify bins by hand, but when automated it would be nice to have a more sensible default. It just defaults to creating 10 bins (which is identical to numpy.histogram, which is what does the work under the hood.) If you know how many bins you want, you can just do: hist(x, bins=6) This gives (for your example) the behavior you seem to want. I don't know of any way that would sensibly choose a number of bins automatically, but I'd consider a patch that proves me wrong. :) Ryan  Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma 
From: Maarten Sneep <maarten.sneep@kn...>  20101022 17:06:43

On Fri, 20101022 at 11:12 0500, Ryan May wrote: > On Fri, Oct 22, 2010 at 9:40 AM, Christopher Fonnesbeck > > > If there are only 7 possible values of the data, which are > evenlyspaced, it should probably not go in and create more than 6 > bins as the default behavior. I know I can specify bins by hand, but > when automated it would be nice to have a more sensible default. > > It just defaults to creating 10 bins (which is identical to > numpy.histogram, which is what does the work under the hood.) If you > know how many bins you want, you can just do: > > hist(x, bins=6) > > This gives (for your example) the behavior you seem to want. I don't > know of any way that would sensibly choose a number of bins > automatically, but I'd consider a patch that proves me wrong. :) I'm moving on from IDL. From that background I used the Coyote library quite a bit, and there I found: binsize = (3.5 * numpy.std(data)) / len(data)**(0.3333) (from http://www.dfanning.com/programs/histoplot.pro known as Scott's Choice of bin size for histograms). >From the binsize and the range of the data, you then figure out an axis for the histogram). Maarten  KNMI, De Bilt T: 030 2206 747 E: Maarten.Sneep@... Room B 2.42 
From: Ryan May <rmay31@gm...>  20101022 18:40:07

On Fri, Oct 22, 2010 at 11:31 AM, Maarten Sneep <maarten.sneep@...> wrote: > On Fri, 20101022 at 11:12 0500, Ryan May wrote: >> On Fri, Oct 22, 2010 at 9:40 AM, Christopher Fonnesbeck >> >> > If there are only 7 possible values of the data, which are >> evenlyspaced, it should probably not go in and create more than 6 >> bins as the default behavior. I know I can specify bins by hand, but >> when automated it would be nice to have a more sensible default. >> >> It just defaults to creating 10 bins (which is identical to >> numpy.histogram, which is what does the work under the hood.) If you >> know how many bins you want, you can just do: >> >> hist(x, bins=6) >> >> This gives (for your example) the behavior you seem to want. I don't >> know of any way that would sensibly choose a number of bins >> automatically, but I'd consider a patch that proves me wrong. :) > > I'm moving on from IDL. From that background I used the Coyote library > quite a bit, and there I found: > > binsize = (3.5 * numpy.std(data)) / len(data)**(0.3333) > > (from http://www.dfanning.com/programs/histoplot.pro known as Scott's > Choice of bin size for histograms). Thanks for that. This actually led me here: http://en.wikipedia.org/wiki/Histogram which gives a bunch of different ways to estimate the number of bins/binsize. It might be worth looking at one of these in general. However, ironically enough, these wouldn't actually give the original poster the desired resultsthe binsizes would lead to lots of bins, many of which would be empty due to the integer data. In fact, it seems that all of these methods are going to break down due to integer data. I guess you could take the ceiling of the calculated binsize...anyone have an opinion on whether calculating binsize/nbins would be a step forward over leaving the default (of 10) and letting the user calculate if they like? Ryan  Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma 
From: Maarten Sneep <maarten.sneep@kn...>  20101025 14:32:49

On Fri, 20101022 at 13:39 0500, Ryan May wrote: > Thanks for that. This actually led me here: > http://en.wikipedia.org/wiki/Histogram which gives a bunch of > different ways to estimate the number of bins/binsize. It might be > worth looking at one of these in general. However, ironically enough, > these wouldn't actually give the original poster the desired > resultsthe binsizes would lead to lots of bins, many of which would > be empty due to the integer data. In fact, it seems that all of these > methods are going to break down due to integer data. I guess you could > take the ceiling of the calculated binsize...anyone have an opinion on > whether calculating binsize/nbins would be a step forward over leaving > the default (of 10) and letting the user calculate if they like? Integer histograms are a different beast altogether. It is not very hard to define a natural bin width for integer histograms: 1. The only sensible alternatives are integer multiples of that. import numpy as np import matplotlib.pyplot as plt data = np.int32(np.rint(200*np.random.randn(10000))) axis = np.arange(data.min(), data.max()+1) hist = np.zeros((data.max()data.min()+1,), dtype=np.int32) # unfortunately the shortcut hist[datadata.min()] += 1 does not work, # the list of indices in data is simplified before looping implicitly. # Explicit loop: for item in data: hist[itemdata.min()] += 1 plt.plot(axis,hist) plt.show() This histogram can easily be adapted to any sensible bin size, as this is the finest possible increment. With floats you have to do things the hard way because there is no such thing as a natural bin size. And yes, the np.histogram() function is much faster. hist2 = np.histogram(data, bins=data.max()data.min()) plt.plot(hist2[1][0:1]+0.5, hist2[0]) plt.show() I don't like putting the data on the binboundaries, as it is very clear what the bins can be in this case. Yes, this is not so much a hard suggestion, as it is a line of thought. Treating integer data for histograms differently from pseudo continuous data is the natural way in in my view. Scaling (grouping bins) could be done to ensure that the most populated bin contains 4*ndata/nbins points (yes, this fails for uniformly distributed data). Maarten  KNMI, De Bilt T: 030 2206 747 E: Maarten.Sneep@... Room B 2.42 