From: Allin Cottrell <cottrell@wf...>  20110503 01:30:06
wage.dat

I'm wondering if there might be an offbyone bug in gnuplot's builtin boxplot functionality. I'm attaching a small data file which I've tried plotting using the following commands (simplified from boxplot.dem): set style data boxplot set pointsize 0.5 set border 2 set xtics ("wage" 1) scale 0.0 set xtics nomirror set ytics nomirror plot 'wage.dat' using (1):1 notitle I calculate quartiles 1 and 3 as 1345 and 2140, respectively, and this seems to agree with what gnuplot shows for the central box. I therefore get 795 for the interquartile range, and multiplying this by the default range multiplier of 1.5 I get 1192.5. Adding this to Q3 gives 3332.5 for the upper whisker limit. The greatest yvalue in the data less than or equal to 3332.5 is 3307, so I'd expect the upper whisker to extend that far, but it extends only to about 2600, which could correspond to the next largest yvalue, namely 2613. That is, I think that according to the docs one should see a whisker extending to 3307 and three high outliers, but in fact I see the whisker going to about 2600 and four high outliers. I can get the result I'd expect to see with the default "range" value of 1.5, if I do set style boxplot range 1.6  Allin Cottrell Department of Economics Wake Forest University 
From: sfeam (Ethan Merritt) <eamerritt@gm...>  20110503 03:10:29

On Monday, 02 May 2011, Allin Cottrell wrote: > I'm wondering if there might be an offbyone bug in gnuplot's > builtin boxplot functionality. > > I'm attaching a small data file which I've tried plotting using > the following commands (simplified from boxplot.dem): > > set style data boxplot > set pointsize 0.5 > set border 2 > set xtics ("wage" 1) scale 0.0 > set xtics nomirror > set ytics nomirror > plot 'wage.dat' using (1):1 notitle > > I calculate quartiles 1 and 3 as 1345 and 2140, respectively, and > this seems to agree with what gnuplot shows for the central box. I > therefore get 795 for the interquartile range, and multiplying > this by the default range multiplier of 1.5 I get 1192.5. Adding > this to Q3 gives 3332.5 for the upper whisker limit. The greatest > yvalue in the data less than or equal to 3332.5 is 3307, so I'd > expect the upper whisker to extend that far, but it extends only > to about 2600, which could correspond to the next largest yvalue, > namely 2613. > > That is, I think that according to the docs one should see a > whisker extending to 3307 and three high outliers, but in fact I > see the whisker going to about 2600 and four high outliers. I can > get the result I'd expect to see with the default "range" value of > 1.5, if I do > > set style boxplot range 1.6 1st quartile == smallest index that encompasses 1/4 of the data points = data[ceil(49/4)] = 1433 3rd quartile == smallest index that encompasses 3/4 of the data points = data[ceil(49*3/4)] = 2115 1.5 * interquartile difference = (21151433) * 1.5 = 716.1 " The most distant point whose value lies within" 716.1 of 2115 is 2613, which is where the whisker ends. There are surprisingly many differing definitions of "quartile". I can sympathize if you prefer a different one than gnuplot uses, but I think gnuplot is at least consistent with its own definition. If people care enough about the definition of "quartile", I suppose we could offer configuration options that select from a set of possible definitions. R does this, as I recall. Ethan 
From: Allin Cottrell <cottrell@wf...>  20110503 03:55:15

On Mon, 2 May 2011, sfeam (Ethan Merritt) wrote: > On Monday, 02 May 2011, Allin Cottrell wrote: > > I'm wondering if there might be an offbyone bug in gnuplot's > > builtin boxplot functionality. > > > > I'm attaching a small data file which I've tried plotting using > > the following commands (simplified from boxplot.dem): > > > > set style data boxplot > > set pointsize 0.5 > > set border 2 > > set xtics ("wage" 1) scale 0.0 > > set xtics nomirror > > set ytics nomirror > > plot 'wage.dat' using (1):1 notitle > > > > I calculate quartiles 1 and 3 as 1345 and 2140, respectively, and > > this seems to agree with what gnuplot shows for the central box. I > > therefore get 795 for the interquartile range, and multiplying > > this by the default range multiplier of 1.5 I get 1192.5. Adding > > this to Q3 gives 3332.5 for the upper whisker limit. The greatest > > yvalue in the data less than or equal to 3332.5 is 3307, so I'd > > expect the upper whisker to extend that far, but it extends only > > to about 2600, which could correspond to the next largest yvalue, > > namely 2613. > > > > That is, I think that according to the docs one should see a > > whisker extending to 3307 and three high outliers, but in fact I > > see the whisker going to about 2600 and four high outliers. I can > > get the result I'd expect to see with the default "range" value of > > 1.5, if I do > > > > set style boxplot range 1.6 > > 1st quartile == smallest index that encompasses 1/4 of the data points > = data[ceil(49/4)] > = 1433 Maybe I'm being dense but I don't see how you're getting that: ceil(49/4) is 13, and data points 8 to 15 all have value 1345. Also the plot from gnuplot 4.5 seems (by mouseover) to have Q1 at 1345. > 3rd quartile == smallest index that encompasses 3/4 of the data points > = data[ceil(49*3/4)] > = 2115 OK, I can see that: the 2140 value that I gave for Q3 is an average of two data points, which implies a different concept of quartile  and the rest follows. > There are surprisingly many differing definitions of "quartile". > I can sympathize if you prefer a different one than gnuplot uses, > but I think gnuplot is at least consistent with its own definition. OK, fair enough. Allin Cottrell 
