From: Yaroslav H. <sf...@on...> - 2014-02-15 13:00:53
|
Dear Matplotlib gurus, Following the code to demonstrate recent(ish) fix for whiskers in boxplots: https://github.com/matplotlib/matplotlib/pull/1855 I have compared it against R's boxplot. Description seems to correspond, and all the percentiles are the same in numpy and R (3.0.1) but R's boxplot seems to have extended IQR box and still have an upper whisker (corresponds to 9000, which is not within 75%+1.5*IQR), when it shouldn't: http://nbviewer.ipython.org/url/www.onerussian.com/tmp/boxplot-Python-vs-R.ipynb is R's plot incorrect or am I missing something (e.g. documented feature in R's boxplot) warranting such a difference? Thanks in advance -- Yaroslav O. Halchenko, Ph.D. http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org Senior Research Associate, Psychological and Brain Sciences Dept. Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik |
From: Paul H. <pmh...@gm...> - 2014-02-15 18:19:32
|
Hey Yaroslav, As the author of the fix and the recent overhaul to boxplots, I can say with certainty that R is wrong! ;-) More seriously, the main thing that I take away from Tukey's paper about boxplots, is that there are many valid ways to draw them. I personally set up the new boxplot functionality to take the most basic boxplot definition very literally. My guess is that R is fudging those rules a bit for the purpose of completeness, or aesthetics, or ...(?) Perhaps one can look at the purpose of boxplots in two different fashions: 1) Matplotlib: show some of the data and some basic stats 2) R (I'm guession): show how the data are /probably/ distributed. Obviously, I prefer #1. But I'm not going to say that #2 is wrong just yet. On Sat, Feb 15, 2014 at 5:00 AM, Yaroslav Halchenko <sf...@on...>wrote: > Dear Matplotlib gurus, > > Following the code to demonstrate recent(ish) fix for whiskers in boxplots: > https://github.com/matplotlib/matplotlib/pull/1855 I have compared it > against > R's boxplot. Description seems to correspond, and all the percentiles are > the > same in numpy and R (3.0.1) but R's boxplot seems to have extended IQR box > and > still have an upper whisker (corresponds to 9000, which is not within > 75%+1.5*IQR), when it shouldn't: > > http://nbviewer.ipython.org/url/www.onerussian.com/tmp/boxplot-Python-vs-R.ipynb > > is R's plot incorrect or am I missing something (e.g. documented feature > in R's boxplot) warranting such a difference? > > Thanks in advance > -- > Yaroslav O. Halchenko, Ph.D. > http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org > Senior Research Associate, Psychological and Brain Sciences Dept. > Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 > Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 > WWW: http://www.linkedin.com/in/yarik > > > ------------------------------------------------------------------------------ > Android apps run on BlackBerry 10 > Introducing the new BlackBerry 10.2.1 Runtime for Android apps. > Now with support for Jelly Bean, Bluetooth, Mapview and more. > Get your Android app in front of a whole new audience. Start now. > > http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk > _______________________________________________ > Matplotlib-devel mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/matplotlib-devel > |
From: Yaroslav H. <sf...@on...> - 2014-02-15 22:21:07
|
Hi Paul, On Sat, 15 Feb 2014, Paul Hobson wrote: > As the author of the fix and the recent overhaul to boxplots Thanks for that! > I can say with certainty that R is wrong! ;-) phew -- thanks ;) > More seriously, the main thing that I take away from Tukey's paper about > boxplots, is that there are many valid ways to draw them. I personally set > up the new boxplot functionality to take the most basic boxplot definition > very literally. My guess is that R is fudging those rules a bit for the > purpose of completeness, or aesthetics, or ...(?) well -- I was trying to figure out why the divergence from R's boxplot help, but so far it seemed to match description/definition for boxplot as in matplotlib. I guess the next step would be to look "inside" (running apt-get source r-base now ;-) ) > Perhaps one can look at the purpose of boxplots in two different fashions: > 1) Matplotlib: show some of the data and some basic stats > 2) R (I'm guession): show how the data are /probably/ distributed.� > Obviously, I prefer #1. But I'm not going to say that #2 is wrong just > yet. would you may be interested to adopt (or just do independently) an option to e.g. plot the data point? once I shared this one http://nbviewer.ipython.org/url/www.onerussian.com/tmp/run_plots.ipynb and the actual code https://gist.github.com/yarikoptic/9023331 I just never got to formalize it into mpl pull request :-/ -- Yaroslav O. Halchenko, Ph.D. http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org Senior Research Associate, Psychological and Brain Sciences Dept. Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik |
From: Paul H. <pmh...@gm...> - 2014-02-15 23:34:08
|
Yaroslav, Those figures look great. Seaborn has some similar functionality (scroll down a bit): http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot The main point of the most recent overhaul of boxplots was to allow users to just what you describe. The methods plt.boxplot and ax.boxplot now do very little on their own. Input data are passed to matplotlib.cbook.boxplot_stats, that function returns a list of dictionaries of statistics, and then ax.bxp actually does the drawing. All of this is to say that you can write your own function to modify boxplot_stats' output or generate independently the list of dictionaries expected by ax.bxp. The keys of those dictionaries can include: - label -> tick label for the boxplot - mean -> mean value (can plot as a line or point) - median -> 50th percentile - q1 -> first quartile (25th pctl) - q3 -> third quartile (75 (pctl) - cilo -> lower notch around the median - ciho -> upper notch around the median - whislo -> end of the lower whisker - whishi -> end of the upper whisker - fliers -> outliers Basically, you can set the appropriate values to whatever you want to draw boxplots however you wish (like open/close diagrams for pandas). Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can either be a float (1.5 by default), a list of integer percentiles (like 5, 95), or the strings 'range', 'limits', or 'min/max', all of which will extend the whiskers to over all of the data. Since you're running off of master, you should access to this new functionality. Here's a link to the PR that overhauled ax.boxplot and created ax.bxp: https://github.com/matplotlib/matplotlib/pull/2643 Looking at it now -- it looks like cbook.boxplot_stats' docstring got cutoff. I'll pull together a PR to fix that soon. Feel free to hit me up with any other questions! -paul On Sat, Feb 15, 2014 at 2:20 PM, Yaroslav Halchenko <sf...@on...>wrote: > Hi Paul, > > On Sat, 15 Feb 2014, Paul Hobson wrote: > > As the author of the fix and the recent overhaul to boxplots > > Thanks for that! > > > I can say with certainty that R is wrong! ;-) > > phew -- thanks ;) > > > More seriously, the main thing that I take away from Tukey's paper > about > > boxplots, is that there are many valid ways to draw them. I > personally set > > up the new boxplot functionality to take the most basic boxplot > definition > > very literally. My guess is that R is fudging those rules a bit for > the > > purpose of completeness, or aesthetics, or ...(?) > > well -- I was trying to figure out why the divergence from R's boxplot > help, but so far it seemed to match description/definition for boxplot > as in matplotlib. I guess the next step would be to look "inside" > (running apt-get source r-base now ;-) ) > > > Perhaps one can look at the purpose of boxplots in two different > fashions: > > 1) Matplotlib: show some of the data and some basic stats > > 2) R (I'm guession): show how the data are /probably/ distributed.� > > Obviously, I prefer #1. But I'm not going to say that #2 is wrong just > > yet. > > would you may be interested to adopt (or just do independently) an > option to e.g. plot the data point? once I shared this one > http://nbviewer.ipython.org/url/www.onerussian.com/tmp/run_plots.ipynb > and the actual code https://gist.github.com/yarikoptic/9023331 > > I just never got to formalize it into mpl pull request :-/ > -- > Yaroslav O. Halchenko, Ph.D. > http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org > Senior Research Associate, Psychological and Brain Sciences Dept. > Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 > Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 > WWW: http://www.linkedin.com/in/yarik > > > ------------------------------------------------------------------------------ > Android apps run on BlackBerry 10 > Introducing the new BlackBerry 10.2.1 Runtime for Android apps. > Now with support for Jelly Bean, Bluetooth, Mapview and more. > Get your Android app in front of a whole new audience. Start now. > > http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk > _______________________________________________ > Matplotlib-devel mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/matplotlib-devel > |
From: Yaroslav H. <sf...@on...> - 2014-02-16 03:45:37
|
On Sat, 15 Feb 2014, Paul Hobson wrote: > Those figures look great. Seaborn has some similar functionality (scroll > down a bit): > [1]http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot right -- seaborn looks really nice and I am yet to take advantage of it. BUT that is why we are talking here, at matplotlib list: seaborn (and few others) while aiming to provide high level convenience, specific to e.g. using pandas as the core datastructures, add improvements which could easily go into stock matplotlib and thus benefit all of the users. That is why I thought that improving boxplot itself could be of more generic benefit, while allowing all the dependent projects take advantage of it without requiring unnecessary fragmentation (e.g. "use seaborn for paired plots", which could easily go straight into stock boxplot operating on arrays). Even violin plots could probably could be done in matplotlib with some basic density estimator (with parameter for a custom one) as an option within boxplot function itself. > The main point of the most recent overhaul of boxplots was to allow users > to just what you describe. The methods plt.boxplot and ax.boxplot now do > very little on their own. Input data are passed to > matplotlib.cbook.boxplot_stats, that function returns a list of > dictionaries of statistics, and then ax.bxp actually does the drawing. All > of this is to say that you can write your own function to modify > boxplot_stats' output or generate independently the list of dictionaries > expected by ax.bxp. > The keys of those dictionaries can include: > - label -> tick label for the boxplot > - mean -> mean value (can plot as a line or point) > - median -> 50th percentile > - q1 -> first quartile (25th pctl) > - q3 -> third quartile (75 (pctl) > - cilo -> lower notch around the median > - ciho -> upper notch around the median > - whislo -> end of the lower whisker > - whishi -> end of the upper whisker > - fliers -> outliers > Basically, you can set the appropriate values to whatever you want to draw > boxplots however you wish (like open/close diagrams for pandas). > Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can > either be a float (1.5 by default), a list of integer percentiles (like 5, > 95), or the strings 'range', 'limits', or 'min/max', all of which will > extend the whiskers to over all of the data. > Since you're running off of master, you should access to this new > functionality. ;-) usually I run off the releases and even more often from releases in Debian stable. But yes -- I have the master and this new functionality looks neat -- thanks again. But those few enhancements, such as - plot actual datapoints with the jitter - plot pairing lines across boxplots seems to be not there and I would consider them worthwhile enhancement > Feel free to hit me up with any other questions! sorry that I have hit with not really a question above ;-) -- Yaroslav O. Halchenko, Ph.D. http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org Senior Research Associate, Psychological and Brain Sciences Dept. Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik |
From: Thomas A C. <tca...@uc...> - 2014-02-16 04:21:46
|
As a side note, adding jitter has been discussed before (https://github.com/matplotlib/matplotlib/issues/2750) in a slightly different context and the consensus was to _not_ add it to mpl (as it is a non-deterministic data transformation). Tom On Sat, Feb 15, 2014 at 10:45 PM, Yaroslav Halchenko <sf...@on...> wrote: > > On Sat, 15 Feb 2014, Paul Hobson wrote: >> Those figures look great. Seaborn has some similar functionality (scroll >> down a bit): >> [1]http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot > > right -- seaborn looks really nice and I am yet to take advantage of it. > > BUT that is why we are talking here, at matplotlib list: seaborn (and > few others) while aiming to provide high level convenience, specific to > e.g. using pandas as the core datastructures, add improvements which > could easily go into stock matplotlib and thus benefit all of the users. > That is why I thought that improving boxplot itself could be of > more generic benefit, while allowing all the dependent projects take > advantage of it without requiring unnecessary fragmentation (e.g. "use > seaborn for paired plots", which could easily go straight into stock > boxplot operating on arrays). > > Even violin plots could probably could be done in matplotlib with > some basic density estimator (with parameter for a custom one) as an > option within boxplot function itself. > >> The main point of the most recent overhaul of boxplots was to allow users >> to just what you describe. The methods plt.boxplot and ax.boxplot now do >> very little on their own. Input data are passed to >> matplotlib.cbook.boxplot_stats, that function returns a list of >> dictionaries of statistics, and then ax.bxp actually does the drawing. All >> of this is to say that you can write your own function to modify >> boxplot_stats' output or generate independently the list of dictionaries >> expected by ax.bxp. >> The keys of those dictionaries can include: >> - label -> tick label for the boxplot >> - mean -> mean value (can plot as a line or point) >> - median -> 50th percentile >> - q1 -> first quartile (25th pctl) >> - q3 -> third quartile (75 (pctl) >> - cilo -> lower notch around the median >> - ciho -> upper notch around the median >> - whislo -> end of the lower whisker >> - whishi -> end of the upper whisker >> - fliers -> outliers >> Basically, you can set the appropriate values to whatever you want to draw >> boxplots however you wish (like open/close diagrams for pandas). >> Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can >> either be a float (1.5 by default), a list of integer percentiles (like 5, >> 95), or the strings 'range', 'limits', or 'min/max', all of which will >> extend the whiskers to over all of the data. >> Since you're running off of master, you should access to this new >> functionality. > > ;-) usually I run off the releases and even more often from releases in > Debian stable. But yes -- I have the master and this new functionality > looks neat -- thanks again. But those few enhancements, such as > > - plot actual datapoints with the jitter > - plot pairing lines across boxplots > > seems to be not there and I would consider them worthwhile enhancement > >> Feel free to hit me up with any other questions! > > sorry that I have hit with not really a question above ;-) > -- > Yaroslav O. Halchenko, Ph.D. > http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org > Senior Research Associate, Psychological and Brain Sciences Dept. > Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 > Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 > WWW: http://www.linkedin.com/in/yarik > > ------------------------------------------------------------------------------ > Android apps run on BlackBerry 10 > Introducing the new BlackBerry 10.2.1 Runtime for Android apps. > Now with support for Jelly Bean, Bluetooth, Mapview and more. > Get your Android app in front of a whole new audience. Start now. > http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk > _______________________________________________ > Matplotlib-devel mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/matplotlib-devel -- Thomas A Caswell PhD Candidate University of Chicago Nagel and Gardel labs tca...@uc... jfi.uchicago.edu/~tcaswell o: 773.702.7204 |
From: Yaroslav H. <sf...@on...> - 2014-02-17 05:40:05
|
On Sat, 15 Feb 2014, Thomas A Caswell wrote: > As a side note, adding jitter has been discussed before > (https://github.com/matplotlib/matplotlib/issues/2750) in a slightly > different context and the consensus was to _not_ add it to mpl (as it > is a non-deterministic data transformation). interesting discussion -- thanks for pointing it out Tom well -- for scatter plot it does make sense to demand jittering "outside". For boxplot -- nope. x-axis (in standard vertical boxplots) doesn't represent informative dimension anyways, besides "groupping" and jitter imho would be only for visualization purpose. Also any non-deterministic jitter could be made deterministic and reproducible by seeding. Since, once again, here randomization would be added only for visualization purpose, it could e.g. always be produced by the rng state seeded with 0 ;-) -- Yaroslav O. Halchenko, Ph.D. http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org Senior Research Associate, Psychological and Brain Sciences Dept. Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik |
From: Matt S. <ma...@pl...> - 2014-02-19 09:56:37
|
Hey all, I thought I'd throw out that a tool I'm working on, Plotly <http://plot.ly>, also does box plots with the option to show jittered points. Instead of passing in stats you pass in an array of values. Here is a notebook with the box plots with jitter: nbviewer.ipython.org/gist/fperez/8930306. You can also view the mean of the array (the dashed line), +/- 1.5 standard deviations around the median, and the outliers of the set (the hollow points): https://plot.ly/~ChrisPP/49. More generally, we're hoping to soon let folks convert matplotlib scripts into a Plotly graph (GitHub Issue<https://github.com/plotly/python-api/issues/3>). We'd love your advice and thoughts. Thanks a bunch, M On Sun, Feb 16, 2014 at 9:39 PM, Yaroslav Halchenko <sf...@on...>wrote: > On Sat, 15 Feb 2014, Thomas A Caswell wrote: > > As a side note, adding jitter has been discussed before > > (https://github.com/matplotlib/matplotlib/issues/2750) in a slightly > > different context and the consensus was to _not_ add it to mpl (as it > > is a non-deterministic data transformation). > > interesting discussion -- thanks for pointing it out Tom > > well -- for scatter plot it does make sense to demand jittering > "outside". For boxplot -- nope. x-axis (in standard vertical > boxplots) doesn't represent informative dimension anyways, besides > "groupping" and jitter imho would be only for visualization purpose. > Also any non-deterministic jitter could be made deterministic and > reproducible by seeding. Since, once again, here randomization would be > added only for visualization purpose, it could e.g. always be produced > by the rng state seeded with 0 ;-) > > -- > Yaroslav O. Halchenko, Ph.D. > http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org > Senior Research Associate, Psychological and Brain Sciences Dept. > Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 > Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 > WWW: http://www.linkedin.com/in/yarik > > > ------------------------------------------------------------------------------ > Android apps run on BlackBerry 10 > Introducing the new BlackBerry 10.2.1 Runtime for Android apps. > Now with support for Jelly Bean, Bluetooth, Mapview and more. > Get your Android app in front of a whole new audience. Start now. > > http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk > _______________________________________________ > Matplotlib-devel mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/matplotlib-devel > |