From: Jody G. <jga...@re...> - 2008-05-20 16:32:27
|
Andrea Aime wrote: > Yeah, silly. Unfortunately that's exactly what you're getting today > out of the quantile classification simple. I have cases, with real > data, where the current function generates 3 subsequent intervals at 0. Okay since the stats are not helping us that much let me share with you my silly idea.... we are only trying to make something where statistically each bucket has the same amount of chance at catching some data. So in cases where you have two buckets on the same data range ... Quantile( {0 0 0 0 3 5 7 9}, 4) ==> {0 0}, {0 0}, {3 5}, {7 9 } We need to get a bit silly; in a normal stats program I would start throwing a "0" entry int bucket one or two based on a bit of natural randomness (ie who cares as long as they each hold the same number of features at the end of the day .... And as a user this would make sense when I saw it in a legend: 0-0 Category A 0-0 Category B 3-5 Category C 7-9 Category D We have made it really obvious to the user that their is a "flat" area; they can even see it in their legend.... So how can we make this happen in the real world? Take the feature hashCode; look at the last fourth bit (ie somewhere in the middle is usually better at being random) and if it is in false place it in Category A, if it is true place it in Category B. What do you think? Jody |