2011/1/1 OKB (not okblacke) <brenbarn@...>:
> I noticed that the boxplot function incorrectly calculates the
> location of the median line in each box. As a simple example, plotting
> the dataset [1, 2, 3, 4] incorrectly plots the median line at 3.
It seems to work fine in matplotlib 1.0.0:
user@...:~$ python
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import matplotlib as mpl
>>> mpl.__version__
'1.0.0'
>>> import matplotlib.pyplot as plt
>>> import matplotlib.mlab as mlab
>>> plt.ion()
>>> plt.boxplot([1, 2, 3, 4])
{'medians': [<matplotlib.lines.Line2D object at 0x3ad6250>], 'fliers':
[<matplotlib.lines.Line2D object at 0x3ad6610>,
<matplotlib.lines.Line2D object at 0x3ad69d0>], 'whiskers':
[<matplotlib.lines.Line2D object at 0x3acff50>,
<matplotlib.lines.Line2D object at 0x3ad4310>], 'boxes':
[<matplotlib.lines.Line2D object at 0x3ad4e50>], 'caps':
[<matplotlib.lines.Line2D object at 0x3ad46d0>,
<matplotlib.lines.Line2D object at 0x3ad4a90>]}
>>> plt.grid()
>>> plt.boxplot([1, 2, 3, 4])
{'medians': [<matplotlib.lines.Line2D object at 0x3dfbad0>], 'fliers':
[<matplotlib.lines.Line2D object at 0x3dfbe90>,
<matplotlib.lines.Line2D object at 0x3dff290>], 'whiskers':
[<matplotlib.lines.Line2D object at 0x3df8810>,
<matplotlib.lines.Line2D object at 0x3df8b90>], 'boxes':
[<matplotlib.lines.Line2D object at 0x3dfb710>], 'caps':
[<matplotlib.lines.Line2D object at 0x3df8f50>,
<matplotlib.lines.Line2D object at 0x3dfb350>]}
>>> plt.grid()
>>> # See attached image.
...
>>> mlab.prctile([1, 2, 3, 4])
array([ 1. , 1.75, 2.5 , 3.25, 4. ])
Goyo
>
> It also seems that the quartile calculations for the box are a
> little peculiar. I have seen some discussion in old mailing list
> postings about mlab.prctile and its ways of calculating percentiles,
> which are different than those of some other software.
>
> I'm aware that there is legitimate disagreement about the "best"
> way to calculate the quartiles. However, it seems to me that mlab's way
> is still not any of these possiblycorrect ways, because it uses int()
> or nparray.astype(int) to coerce the percentile result to an integer
> index. This TRUNCATES the floatingpoint result. No accepted quantile
> calculating method that I'm aware of does this; they all ROUND instead
> of truncating (if they want to coerce to an integer index at all, in
> order to produce a quantile value that is an element of the data set),
> or in some cases they round uniformly up for the lower quartile and
> down for the upper. You can see a summary of different methods at
> http://www.amstat.org/publications/jse/v14n3/langford.html ; the method
> used by mlab does not appear to agree with any of these.
>
> I would suggest that mlab.prctile be fixed to conform to some one
> or other of these methods, rather than adding to the proliferation of
> approaches to quantilecalculation. Is there any motivation for always
> truncating to integer (other that "it's quicker to type" :)?
>
> Also, regardless of these quartile issues, there is, as far as I'm
> aware, no one who denies that the median of a (sorted) data set with an
> even number of values is the mean of the middle two values. Since numpy
> is already a dependency for matplotlib, boxplot shouldn't use
> mlab.prctile at all to decide where to plot the median line  just use
> numpy.median.
>
> Thanks,
> 
> OKB (not okblacke)
> Brendan Barnwell
> "Do not follow where the path may lead. Go, instead, where there is
> no path, and leave a trail."
> author unknown
