[Crm114-discuss] Help w/ Bayesian Formula

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

I'm currently writing a SBPH/Markovian Mail Filter in C#
(https://savannah.nongnu.org/projects/mailprobe/).
However, I'm wondering about your implementation since I'm partly
(mostly ?) taking ideas from your project.

Here's my problem : 
Let say that Bayes formula is :
P(in class / feature) = [P(feat / in class) * P( in class) ] /
[P(feat/class) * P(in class) + P(feat/ not class) * P (not class)]

Using Markovian Matching, we get local probabilities of the form :

Plocal-spam = 0.5 + [(Nspam - NnotSpam)*Weigt] / [C1 * (Nspam + Nnonspam
+ C2) * WeightMax]
=> do you actually use this formula ? BTW, does anyone know where this
formula comes from ?

First question :
1) Plocal-spam is equivalent to the P(feat / class) part in the Bayesian
Rule, right ?
2) I'd love if you could help me understand the following part of your
classify_details.txt :

"We start assuming that the chance of spam is 50/50" 
=>
This means we set P(in class) = 0.5, right ? (in the baysian rule)
and, for N classes, we start the chance of a class at 1/N ?

"We count up the total number of features in the good versus evil
feature .css files. We use these counts to normalize the chances of good
versus evil features, so if your training sets are mostly good, it
doesnt predispose the filter to think that everything is good."
=> Let's say the total feature count in the "good" class is 10 and the
total feature count in the "evil" class is 20. This means we get 30
features totally. What do you  call exactly "normalize" ?
In Bayes Formula, here's what we need to calculate :
 -a) P(feat/in class)
 -b) P(in class)
 -c) P(feat / not class)
 -d) P( not class)
What do you need to "normalize" exactly ? 

P(feat / in class) and P(feat / not class) => doesn't depend on the
other classes. If a specific feature was counted 4 times in the class c,
and class c has 100 features, then P(feat / in class) = 4/100

if we calculate the P(feat / in class) by using the formula : 
Plocal-spam = 0.5 + [(Nspam - NnotSpam)*Weigt] / [C1 * (Nspam + Nnonspam
+ C2) * WeightMax]
then does "normalizing" mean that ..
total count for "evil": 20
total count for "good" : 10
=> then, if the database says that Nspam = 2 and NnotSpam = 3 (for a
certain feature), then do we say :
Nspam = 2 and Nnotspam = (20/10) *3 = 6
and then calculate the probability ?

P( in class) and P(not class) => didn't we just say it was 0.5 ?

"We repeated form a feature with the polynomials, check the .css files
to see what the counts of that feature are for spam and nonspam, and use
the counts to calculate P(A/S) and P(A/NS) [ remember, we correct for
the fact that we may have different total counts in the spam and nonspam
categories ]
Same question as before. what about normalization ?

"We also bound P(A/S) and P(A/NS) to prevent any 0.0 or 1.0
probabilities from saturating the system"
=> yes, sure, ok. I still haven't understood the 
[1/featurecount+2, 1 - 1/featurecount+2] limit.. where does it come
from ? empirical ?
and what about "and then to add further uncertainty to that bound
additionally buy a factor of 1/(featurecount+1) => do you use random
number or something like that ?

"Once we have P(A/S) and P(A/NS), we calculate the new P(S) and P(NS).
Then we get the next feature out of the polynomial hash pipeline and
repeat until we hit the end of the text"..
Ok, so you mean that you set P(S) and P(NS) to the new value, and then
recalculate the Bayesian formula for the next feature.

I'm just wondering why everybody seems to use a different way of doing
things :
usually, people seem to use something like
- supress the denominator in the Bayesian Rule (supposed to be a
constant ? Doesn't seem to be one in your formula
- then to calculate P(Features / class) * P(Class), they usually
calculate : P(Feature1/class) * P(feature2/class)*....*P(featureN/class)
*P(class)

this way of calculating and yours are different, so I'm really wondering
which one is the best ;p

3) About TOE: I'm wondering about what you call TOE: Do you mean that
the user only has to train the filter on errors, or that CRM114 keeps
re-learning error-prone emails until they are classified correctly ?

I think that's all.
Thank you very much for your help.
Sami Dalouche