RE: [Classifier4j-devel] Fwd: calculateOverallProbability Questio ns
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-18 04:35:02
|
> > In my word_probability database, I currently have no > "nonMatchingCount"s. Yes, that will cause problems! > Therefore all my word probabilities are turning out as LOWER_BOUND, > NEUTRAL_PROBABILITY, or .99 since effectively > matchingCount/matchingCount = > 1. BayesianClassifier.normaliseSignificance() presumably > adjusts this > outcome from 1 to .99. > > This I believe represents a major difference between the > current method and my > understanding of POPFile's method. > > At this point, POPFile is calculating: > > Occurences of Word A in Category XYZ / Total Occurences of > ALL words in > Category XYZ. > > In other words: > match_count of A where Category=XYZ / sum(match_count) from > category XYZ. > Classifier4J calculates probability of a word matching = match_count/(match_count + non_match_count) I guess the difference between the two methods is quite important. I'm trying to analyse what it means and which is more useful. Consider the following case (based on my actual database of words): I want to analyse the sentance: "Apache Jakarta is a Java Site" to see if it matches my "I would probably be interested in this" criteria. I am expecting that it will match. // we need to calculate xy/(xy + z) // where z = (1-x)(1-y) Total of select sum(match_count) from word_probability = 15359 Apache: M=16, NM=2, C4J-P=0.8889 PF-P=0.001 Jakarta: M=16, NM=0, C4J-P=0.99 (using cut-off) PF-P=0.001 is = stop word a = stop word Java: M=98, NM=13, C4J-P=0.6805 PF-P=0.0064 Site: M=7, NM=4, C4J-P=0.6364 PF-P=0.0005 For Classifier4J, the calculation goes: (0.8889)(0.99)(0.6805)(0.6364)/((0.88889)(0.99)(0.6805)(0.6364) + (1 - 0.8889)(1 - 0.99)(1 - 0.6805)(1 - 0.6364)) = 0.3811065397722/(0.3811065397722 + (0.1111)(0.01)(0.3195)(0.3636)) = 0.3811065397722/(0.3811065397722 + 0.0001290650922) = 0.3811065397722/0.3812356048644 = 0.9996 For POPFile: (0.001)(0.001)(0.0064)(0.0005)/((0.001)(0.001)(0.0064)(0.0005) + (1 - 0.001)(1 - 0.001)(1 - 0.0064)(1 - 0.0005)) = 0.0000000000032/(0.0000000000032 + (0.999)(0.999)(0.9936)(0.9995)) = 0.0000000000032/(0.0000000000032 + 0.992904589291032) = 0.0000000000032/0.992904589294232 = pretty close to zero Now I realize they do their stuff with logs to get around this, but I don't really think you can call that Bayesian. Bayes's theroum looks like: <http://www.paulgraham.com/naivebayes.html> > This is my interpretation of the method discussed at: > http://sourceforge.net/docman/display_doc.php?docid=13334&grou > p_id=63137 > > Have I overlooked something, or is this just a difference > between the two > calucations? > I don't think you've overlooked anything. |