From: Sami D. <sk...@fr...> - 2005-01-28 21:08:13
|
Hi, I'm currently writing a SBPH/Markovian Mail Filter in C# (https://savannah.nongnu.org/projects/mailprobe/). However, I'm wondering about your implementation since I'm partly (mostly ?) taking ideas from your project. Here's my problem : Let say that Bayes formula is : P(in class / feature) = [P(feat / in class) * P( in class) ] / [P(feat/class) * P(in class) + P(feat/ not class) * P (not class)] Using Markovian Matching, we get local probabilities of the form : Plocal-spam = 0.5 + [(Nspam - NnotSpam)*Weigt] / [C1 * (Nspam + Nnonspam + C2) * WeightMax] => do you actually use this formula ? BTW, does anyone know where this formula comes from ? First question : 1) Plocal-spam is equivalent to the P(feat / class) part in the Bayesian Rule, right ? 2) I'd love if you could help me understand the following part of your classify_details.txt : "We start assuming that the chance of spam is 50/50" => This means we set P(in class) = 0.5, right ? (in the baysian rule) and, for N classes, we start the chance of a class at 1/N ? "We count up the total number of features in the good versus evil feature .css files. We use these counts to normalize the chances of good versus evil features, so if your training sets are mostly good, it doesnt predispose the filter to think that everything is good." => Let's say the total feature count in the "good" class is 10 and the total feature count in the "evil" class is 20. This means we get 30 features totally. What do you call exactly "normalize" ? In Bayes Formula, here's what we need to calculate : -a) P(feat/in class) -b) P(in class) -c) P(feat / not class) -d) P( not class) What do you need to "normalize" exactly ? P(feat / in class) and P(feat / not class) => doesn't depend on the other classes. If a specific feature was counted 4 times in the class c, and class c has 100 features, then P(feat / in class) = 4/100 if we calculate the P(feat / in class) by using the formula : Plocal-spam = 0.5 + [(Nspam - NnotSpam)*Weigt] / [C1 * (Nspam + Nnonspam + C2) * WeightMax] then does "normalizing" mean that .. total count for "evil": 20 total count for "good" : 10 => then, if the database says that Nspam = 2 and NnotSpam = 3 (for a certain feature), then do we say : Nspam = 2 and Nnotspam = (20/10) *3 = 6 and then calculate the probability ? P( in class) and P(not class) => didn't we just say it was 0.5 ? "We repeated form a feature with the polynomials, check the .css files to see what the counts of that feature are for spam and nonspam, and use the counts to calculate P(A/S) and P(A/NS) [ remember, we correct for the fact that we may have different total counts in the spam and nonspam categories ] Same question as before. what about normalization ? "We also bound P(A/S) and P(A/NS) to prevent any 0.0 or 1.0 probabilities from saturating the system" => yes, sure, ok. I still haven't understood the [1/featurecount+2, 1 - 1/featurecount+2] limit.. where does it come from ? empirical ? and what about "and then to add further uncertainty to that bound additionally buy a factor of 1/(featurecount+1) => do you use random number or something like that ? "Once we have P(A/S) and P(A/NS), we calculate the new P(S) and P(NS). Then we get the next feature out of the polynomial hash pipeline and repeat until we hit the end of the text".. Ok, so you mean that you set P(S) and P(NS) to the new value, and then recalculate the Bayesian formula for the next feature. I'm just wondering why everybody seems to use a different way of doing things : usually, people seem to use something like - supress the denominator in the Bayesian Rule (supposed to be a constant ? Doesn't seem to be one in your formula - then to calculate P(Features / class) * P(Class), they usually calculate : P(Feature1/class) * P(feature2/class)*....*P(featureN/class) *P(class) this way of calculating and yours are different, so I'm really wondering which one is the best ;p 3) About TOE: I'm wondering about what you call TOE: Do you mean that the user only has to train the filter on errors, or that CRM114 keeps re-learning error-prone emails until they are classified correctly ? I think that's all. Thank you very much for your help. Sami Dalouche |