From: Andrea A. <aa...@op...> - 2008-05-19 16:44:59
|
Hi, I'm having some troubles using the quantile classification algorithm. As you may know, quantile figures out how to classify a range of numbers in a way that each class has the same number of features in it. Consider a case when an attribute has the following values (in different features): {0 0 0 0 3 5 7 9}. Then ask the quantile classifier to create a 4 intervals classification, and you'll get: {0 0} {0 0} {3 5} {7 9} This does not look very nice... I'm wondering if the quantile algorithm should consider this and avoid breaking the classes when the the same value will keep on appearing on the next class. For most users the following classification: att < 3 3 < att <= 5 5 < att or put another way: {0 0 0 0} {3 5} {7 9} thought not made of 3 intervals, would make much more sense. What I'm wondering is, can we have a quantile function that returns eventually less intervals but that does not builds odd classes like the current one? Cheers Andrea |
From: Jody G. <jga...@re...> - 2008-05-19 21:08:57
|
What a difficult question; is there a strict definition of the quantile function we could grab from statistics or something? Given you example I want to ask: what is more important; the number of classifications, or the fact that they are "even" in size... If we go for even in size; you may get 2 categories when you asked for three Quantile( {0 0 0 0 3 5 7 9}, 2) ==> {0 0 0 0}, { 3 5 7 9 } Quantile( {0 0 0 0 3 5 7 9}, 3) ==> {0 0 0 0}, { 3 5 7 9 } This may be a strange case of what do you expect? If I am looking at a map of summary of I want to know what the colors represent; and if I ask the application to color equal quantities of data in different colors; for the data you provided we could only make a map with 2 categories; anything else would be a mistake ... So while I can think of silly ways to break the content up into {0 0} and {0 0} - they are just that - silly. Jody Andrea Aime wrote: > Hi, > I'm having some troubles using the quantile classification algorithm. > As you may know, quantile figures out how to classify a range of numbers > in a way that each class has the same number of features in it. > > Consider a case when an attribute has the following values (in different > features): {0 0 0 0 3 5 7 9}. Then ask the quantile classifier to create > a 4 intervals classification, and you'll get: > {0 0} > {0 0} > {3 5} > {7 9} > This does not look very nice... I'm wondering if the quantile algorithm > should consider this and avoid breaking the classes when the the same > value will keep on appearing on the next class. For most users the > following classification: > att < 3 > 3 < att <= 5 > 5 < att > or put another way: > {0 0 0 0} > {3 5} > {7 9} > thought not made of 3 intervals, would make much more sense. > What I'm wondering is, can we have a quantile function that returns > eventually less intervals but that does not builds odd classes like > the current one? > > Cheers > Andrea > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Geotools-devel mailing list > Geo...@li... > https://lists.sourceforge.net/lists/listinfo/geotools-devel > |
From: Andrea A. <aa...@op...> - 2008-05-20 08:18:07
|
Jody Garnett ha scritto: > What a difficult question; is there a strict definition of the quantile > function we could grab from statistics or something? I did not find much, and none of what I've found talks about how to handle flat areas in the data histogram: http://www.gisbanker.com/introduction_part5.htm http://www.geovista.psu.edu/grants/dg-qg/classing_epi/summary.html http://www.censusmapper.com/CM_Help/classifyfield.htm ... > Given you example I want to ask: what is more important; the number of > classifications, or the fact that they are "even" in size... If only the number was important, an equal interval classification would have been chosen. Quantile is defined by the "even in size", but given enough flat areas in your data historgram, how do you guess what the even size would be? The method I suggested won't guarantee nor the interval nor the equal size, but just avoid the silly interval structure... do you have any suggestion on how to deal with this? What would you do with: Quantile( {-1 -2 0 0 0 0 3 5 7 9}, 2) ==> ? Quantile( {-1 -2 0 0 0 0 3 5 7 9}, 3) ==> ? The method I proposed, that is, detect the flat area in the histogram and avoid breaking the class until you get out of it, would generate the same result for both: {-1 -2 0 0 0 0} {3 5 7 9} For 3 intervals, another non totally silly output could be: {-1 -2} {0 0 0 0} {3 5 7 9} Generally speaking, detect flat areas, if they are big enough, make them a class apart, since they somehow represent an anomaly in the data. Of course applying this principle you could get more classes than you asked for. For example: Quantile( {-10 -9 -2 0 0 0 1 2 4 9 9 9}, 3) ==> what now? the "don't break if in flat area" would generate only 2 classes: {-10 -9 -2 0 0 0} {1 2 4 9 9 9} the "break out flat areas if big enough" approach would generate 4: {-10 -9 -2} {0 0 0} {1 2 4} {9 9 9} > If we go for even in size; you may get 2 categories when you asked for > three > Quantile( {0 0 0 0 3 5 7 9}, 2) ==> {0 0 0 0}, { 3 5 7 9 } > Quantile( {0 0 0 0 3 5 7 9}, 3) ==> {0 0 0 0}, { 3 5 7 9 } > > This may be a strange case of what do you expect? If I am looking at a > map of summary of I want to know what the colors represent; and if I ask > the application to color equal quantities of data in different colors; > for the data you provided we could only make a map with 2 categories; > anything else would be a mistake ... > > So while I can think of silly ways to break the content up into {0 0} > and {0 0} - they are just that - silly. Yeah, silly. Unfortunately that's exactly what you're getting today out of the quantile classification simple. I have cases, with real data, where the current function generates 3 subsequent intervals at 0. Cheers Andrea |
From: Adrian C. <ac...@gm...> - 2008-05-20 09:02:29
|
Hey all, Wherein we discover that stats are hard, even for the simple questions... On Tue, 2008-05-20 at 10:18 +0200, Andrea Aime wrote: > Jody Garnett ha scritto: > > What a difficult question; is there a strict definition of the quantile > > function we could grab from statistics or something? I'm not sure the use of "Quantile" for this function is correct terminology but don't have time to explore it rigourously. So far all I've learned is that I've now forgotten how to use R. As ever, wikipedia is our friend these days: By a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value. Since the key footnote points us to R, we can start to trust this as an authoritative source. http://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html In R, it seems you want a type=3 method of quantification " Type 3 SAS definition: nearest even order statistic" but, again, I don't have the time to answer this rigourously today. > Quantile( {-1 -2 0 0 0 0 3 5 7 9}, 2) ==> ? > Quantile( {-1 -2 0 0 0 0 3 5 7 9}, 3) ==> ? eratosthenes:~> R ... > x <- c(-1,-2,0,0,0,0,3,5,7,9) > n <- 2 > quantile(x,probs=seq(0,1,1/n)) 0% 50% 100% -2 0 9 > n <-3 > quantile(x,probs=seq(0,1,1/n)) 0% 33.33333% 66.66667% 100% -2 0 3 9 with the value shown being the rightmost in the original vector and defining the breaks which can be applied to the vector to yield the resulting classes. (You don't care about the leftmost value). > Quantile( {-10 -9 -2 0 0 0 1 2 4 9 9 9}, 3) ==> what now? > x2 <- c(-10,-9,-2,0,0,0,1,2,4,9,9,9) > n <- 3 > quantile(x2,probs=seq(0,1,1/n)) 0% 33.33333% 66.66667% 100% -10.000000 0.000000 2.666667 9.000000 > quantile(x2,probs=seq(0,1,1/n),type=3) 0% 33.33333% 66.66667% 100% -10 0 2 9 Also you might look at the spreadsheet functions definitions since they might explain the terminology needed. --adrian |
From: Andrea A. <aa...@op...> - 2008-05-20 17:34:10
|
Adrian Custer ha scritto: > Hey all, > > Wherein we discover that stats are hard, even for the simple > questions... > > > On Tue, 2008-05-20 at 10:18 +0200, Andrea Aime wrote: >> Jody Garnett ha scritto: >>> What a difficult question; is there a strict definition of the quantile >>> function we could grab from statistics or something? > > I'm not sure the use of "Quantile" for this function is correct > terminology but don't have time to explore it rigourously. So far all > I've learned is that I've now forgotten how to use R. > > > As ever, wikipedia is our friend these days: > By a quantile, we mean the fraction (or percent) of points below > the given value. That is, the 0.3 (or 30%) quantile is the point > at which 30% percent of the data fall below and 70% fall above > that value. Right, but that is not a good definition for the what the so called quantile classification aims to, that is, generate a set of rules to paint a map, in the case I'm trying to handle, that is, when there is a wide range of data that contains the same value. > Since the key footnote points us to R, we can start to trust this as an > authoritative source. > > http://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html > > > In R, it seems you want a type=3 method of quantification > " Type 3 SAS definition: nearest even order statistic" > but, again, I don't have the time to answer this rigourously today. > > >> Quantile( {-1 -2 0 0 0 0 3 5 7 9}, 2) ==> ? >> Quantile( {-1 -2 0 0 0 0 3 5 7 9}, 3) ==> ? > > eratosthenes:~> R > ... > >> x <- c(-1,-2,0,0,0,0,3,5,7,9) >> n <- 2 >> quantile(x,probs=seq(0,1,1/n)) > 0% 50% 100% > -2 0 9 >> n <-3 >> quantile(x,probs=seq(0,1,1/n)) > 0% 33.33333% 66.66667% 100% > -2 0 3 9 > > with the value shown being the rightmost in the original vector and > defining the breaks which can be applied to the vector to yield the > resulting classes. (You don't care about the leftmost value). > > >> Quantile( {-10 -9 -2 0 0 0 1 2 4 9 9 9}, 3) ==> what now? > >> x2 <- c(-10,-9,-2,0,0,0,1,2,4,9,9,9) >> n <- 3 >> quantile(x2,probs=seq(0,1,1/n)) > 0% 33.33333% 66.66667% 100% > -10.000000 0.000000 2.666667 9.000000 >> quantile(x2,probs=seq(0,1,1/n),type=3) > 0% 33.33333% 66.66667% 100% > -10 0 2 9 Again, not very useful... it's telling you that at the 33% break there is a 0, and by applying it, you'd get a class that ends with 0, and another that starts with 0. Which is something the layman using the application does not understand, it does not make sense to him. That's why I was suggesting to have the classes avoid breaks on flat areas.... so I'm back at square one... current method is mathematically sound, but does not make any sense to the normal user. What now? Cheers Andrea |
From: Andrea A. <aa...@op...> - 2008-05-22 10:24:32
|
Andrea Aime ha scritto: > Andrea Aime ha scritto: > ... >> Again, not very useful... it's telling you that at the 33% break there >> is a 0, and by applying it, you'd get a class that ends with 0, and >> another that starts with 0. Which is something the layman using >> the application does not understand, it does not make sense to him. >> >> That's why I was suggesting to have the classes avoid breaks on >> flat areas.... so I'm back at square one... current method is >> mathematically sound, but does not make any sense to the normal >> user. What now? > > Well, since I have a customer that needs this, and there seems > to be no agreement (or lack of interest) on what to do, > I'll roll a custom variant of the quantile algorithm inside > GeoServer that does what I suggested, set apart the flat areas > of the histogram in their own classes when they are big enough (say, > half of a standard sized class?), and try to build classes > with the expected size for the rest of the values. I also illustrate a possible post processing approach at http://www.nabble.com/sldService-patches-td17401397.html that would avoid messing with the classification functions altogether. Cheers Andrea |
From: Jody G. <jga...@re...> - 2008-05-22 21:34:46
|
That is also an option Andrea; any chance we can come up with unqiue names for these two ideas? ie implement and document them as separate functions... > The first thing I need, is to change the generated rules so that they > all use closed intervals, such as: > 0 <= x <= 10 > 10 < x <= 20 > 20 < x <= 30 > as opposed of today's result: > x <= 10 > 10 < x <= 20 > x > 20 I know the SLD generation code makes use of an SLD "else" clause to catch data that falls off both ends; given the existing categorization function can you not just handle your SLD generation differently based on the existing ranges? Ie these categorizes are something you process into a set of SLD rules; you may find the code already available in cbrewer. We can also look at the SE 1.1 docs to see how they break down the result of their categorization function. Jody |
From: Jody G. <jga...@re...> - 2008-05-20 16:27:12
|
Andrea Aime wrote: > Jody Garnett ha scritto: >> What a difficult question; is there a strict definition of the >> quantile function we could grab from statistics or something? > > I did not find much, and none of what I've found talks about how to > handle flat areas in the data histogram: > http://www.gisbanker.com/introduction_part5.htm > http://www.geovista.psu.edu/grants/dg-qg/classing_epi/summary.html > http://www.censusmapper.com/CM_Help/classifyfield.htm > ... Your first link there hits the nail on the head: > Quantiles are best suited for data that is linearly distributed; in > other words, data that does not have disproportionate numbers of > features with similar values. So basically user beware... Jody |
From: Jody G. <jga...@re...> - 2008-05-20 16:32:27
|
Andrea Aime wrote: > Yeah, silly. Unfortunately that's exactly what you're getting today > out of the quantile classification simple. I have cases, with real > data, where the current function generates 3 subsequent intervals at 0. Okay since the stats are not helping us that much let me share with you my silly idea.... we are only trying to make something where statistically each bucket has the same amount of chance at catching some data. So in cases where you have two buckets on the same data range ... Quantile( {0 0 0 0 3 5 7 9}, 4) ==> {0 0}, {0 0}, {3 5}, {7 9 } We need to get a bit silly; in a normal stats program I would start throwing a "0" entry int bucket one or two based on a bit of natural randomness (ie who cares as long as they each hold the same number of features at the end of the day .... And as a user this would make sense when I saw it in a legend: 0-0 Category A 0-0 Category B 3-5 Category C 7-9 Category D We have made it really obvious to the user that their is a "flat" area; they can even see it in their legend.... So how can we make this happen in the real world? Take the feature hashCode; look at the last fourth bit (ie somewhere in the middle is usually better at being random) and if it is in false place it in Category A, if it is true place it in Category B. What do you think? Jody |
From: Andrea A. <aa...@op...> - 2008-05-22 10:03:26
|
Andrea Aime ha scritto: ... > Again, not very useful... it's telling you that at the 33% break there > is a 0, and by applying it, you'd get a class that ends with 0, and > another that starts with 0. Which is something the layman using > the application does not understand, it does not make sense to him. > > That's why I was suggesting to have the classes avoid breaks on > flat areas.... so I'm back at square one... current method is > mathematically sound, but does not make any sense to the normal > user. What now? Well, since I have a customer that needs this, and there seems to be no agreement (or lack of interest) on what to do, I'll roll a custom variant of the quantile algorithm inside GeoServer that does what I suggested, set apart the flat areas of the histogram in their own classes when they are big enough (say, half of a standard sized class?), and try to build classes with the expected size for the rest of the values. I'd like to avoid that, but I see no way to do so without introducing extra parameters that would break all existing callers of the current quantile function... Cheers Andrea |
From: Jody G. <jga...@re...> - 2008-05-22 21:28:49
|
Andrea Aime wrote: > Well, since I have a customer that needs this, and there seems to be > no agreement (or lack of interest) on what to do, I'll roll a custom > variant of the quantile algorithm inside GeoServer that does what I > suggested, set apart the flat areas of the histogram in their own > classes when they are big enough (say, half of a standard sized > class?), and try to build classes with the expected size for the rest > of the values. That sounds fine; why not just update the geotools implementation? ie we found something that is a mistake (two categories with the same (0,0) range) can we fix it? > I'd like to avoid that, but I see no way to do so without introducing > extra parameters that would break all existing callers of the current > quantile function... So far you have some style generation code; and it probably would die on the case you describe anyways... if I cared to have specific functionality I would have provided test cases. Lets rock and roll... Cheers, Jody |
From: Andrea A. <aa...@op...> - 2008-05-23 07:18:02
|
Jody Garnett ha scritto: > Andrea Aime wrote: >> Well, since I have a customer that needs this, and there seems to be >> no agreement (or lack of interest) on what to do, I'll roll a custom >> variant of the quantile algorithm inside GeoServer that does what I >> suggested, set apart the flat areas of the histogram in their own >> classes when they are big enough (say, half of a standard sized >> class?), and try to build classes with the expected size for the rest >> of the values. > That sounds fine; why not just update the geotools implementation? ie we > found something that is a mistake (two categories with the same (0,0) > range) can we fix it? Because in the last mail you proposed to fix it in ways I could not see how to implement in a general way when there are multiple areas of the data histogram that are flat (you suggested to try and keep the intervals equal size and change the number of intervals, or to force the number of intervals and accept the number of items could be not equal). >> I'd like to avoid that, but I see no way to do so without introducing >> extra parameters that would break all existing callers of the current >> quantile function... > So far you have some style generation code; and it probably would die on > the case you describe anyways... if I cared to have specific > functionality I would have provided test cases. Lets rock and roll... Sorry, this was meant to be an hour long fix and it already took me an afternoon in mail exchanges only, time out for me. I worked around it in the style generation code GeoServer side. Cheers Andrea |