RE: [Classifier4j-devel] Fwd: calculateOverallProbability Questio ns

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 
> In my word_probability database, I currently have no 
> "nonMatchingCount"s.  

Yes, that will cause problems!

> Therefore all my word probabilities are turning out as LOWER_BOUND, 
> NEUTRAL_PROBABILITY, or .99 since effectively 
> matchingCount/matchingCount = 
> 1.  BayesianClassifier.normaliseSignificance()  presumably 
> adjusts this 
> outcome from 1 to .99.
> 
> This I believe represents a major difference between the 
> current method and my 
> understanding of POPFile's method.
> 
> At this point, POPFile is calculating:
> 
> Occurences of Word A in Category XYZ / Total Occurences of 
> ALL words in 
> Category XYZ.
> 
> In other words:
> match_count of A where Category=XYZ / sum(match_count) from 
> category XYZ.
> 

Classifier4J calculates probability of a word matching =
match_count/(match_count + non_match_count)

I guess the difference between the two methods is quite important. I'm
trying to analyse what it means and which is more useful.

Consider the following case (based on my actual database of words):

I want to analyse the sentance: "Apache Jakarta is a Java Site" to see if it
matches my "I would probably be interested in this" criteria. I am expecting
that it will match.

// we need to calculate xy/(xy + z)
// where z = (1-x)(1-y)

Total of select sum(match_count) from word_probability = 15359

Apache: M=16, NM=2, C4J-P=0.8889  PF-P=0.001
Jakarta: M=16, NM=0, C4J-P=0.99 (using cut-off) PF-P=0.001
is = stop word
a = stop word
Java: M=98, NM=13, C4J-P=0.6805 PF-P=0.0064
Site: M=7, NM=4, C4J-P=0.6364 PF-P=0.0005

For Classifier4J, the calculation goes:

(0.8889)(0.99)(0.6805)(0.6364)/((0.88889)(0.99)(0.6805)(0.6364) + (1 -
0.8889)(1 - 0.99)(1 - 0.6805)(1 - 0.6364))

= 0.3811065397722/(0.3811065397722 + (0.1111)(0.01)(0.3195)(0.3636))
= 0.3811065397722/(0.3811065397722 + 0.0001290650922)
= 0.3811065397722/0.3812356048644
= 0.9996

For POPFile:
(0.001)(0.001)(0.0064)(0.0005)/((0.001)(0.001)(0.0064)(0.0005) + (1 -
0.001)(1 - 0.001)(1 - 0.0064)(1 - 0.0005))

= 0.0000000000032/(0.0000000000032 + (0.999)(0.999)(0.9936)(0.9995))
= 0.0000000000032/(0.0000000000032 + 0.992904589291032)
= 0.0000000000032/0.992904589294232 
= pretty close to zero

Now I realize they do their stuff with logs to get around this, but I don't
really think you can call that Bayesian. Bayes's theroum looks like:
<http://www.paulgraham.com/naivebayes.html>

> This is my interpretation of the method discussed at:
> http://sourceforge.net/docman/display_doc.php?docid=13334&grou
> p_id=63137
> 
> Have I overlooked something, or is this just a difference 
> between the two 
> calucations?
> 

I don't think you've overlooked anything.