I’m just writing my diploma thesis where I’m describing decision process of POPFile. But I have no idea how the Score is calculated. I have read through whole forum, but I did not find any example of Score calculation. I have also read the source code where I found following explanation:
# For each word go through the buckets and calculate
# P(word|bucket) and then calculate P(word|bucket) ^ word count
# and multiply to the score
But anyway I am not able to do it manually and explain how the calculation is done exactly. So please could you give me an example of Score calculation. Thank you very much. It is very important for me.
>> I have read through whole forum, but I did not find any example ... <<
There is a separate "documentation" section here:
This "How POPFile does email classification" explanation may help you:
The POPFile project has a new website at http://getpopfile.org
According these documents I already tried to recalculate score for one email. I chose one word and tried to found out how the frequency, probability and score for this word were calculated. I have 7 buckets. Occurrences of this word is in buckets:
B1:68, B2:5, B3:16, B4:2, B5:9, B6:0, B7:1 (number of times my word W appear in each bucket)
B1:33145, B2:49145, B3:33760, B4:32930, B5:56339, B6:34059, B7:23031 (total number of words)
B1:8318, B2:12578, B3:8988, B4:8608, B5:14984, B6:9899, B7:6912 number of unique words in each bucket
Number of all unique words: 70287
Number of all words: 262409
The results from POPFile for my word are following:
B1: B2: B3: B4: B5: B6: B7:
Frequency 0,00205 0,0001 0,00047 0,00006 0,00016 none 0,00004
Probability 0,6733 0,0495 0,1584 0,0198 0,0891 none 0,0099
Score 3,7311 2,4265 3,0947 2,2024 2,6224 none 2,0567
I found out, that frequency is count as number of times word W appears in bucket divided by total number of words in bucket. And probability is count as number of times word W appears in bucket divided by total number of times word W appears all buckets. But I have no idea how the score is calculated. Please could you help me with this? It is very important for me. Thank you!
The score is calculated by below formula:
score = (log(frequency) - not_likely) / log(10)
where not_likely = -log(number_of_all_words * 10).
In your case,
score1 = (log(0,00205)+log(262409*10))/log(10) = 3.7307
In other words,
score1 = log(68/33145 * 262409*10)/log(10) = 3.7311
Thank you very much. It is helpful for me, but I’m not sure, where and why I get number 10 by which I multiply number_of_all_words and than I divide by log(10).
The whole point of homework is that people should do it themselves!
The 'not_likely' is the frequency for unseen words in the bucket and POPFile assigns 1/(10*number_of_all_words) as its value:
Dividing by 10 converts natural logarithm to common logarithm.
I am sorry, but I spend several hours trying to recalculate these numbers according to documentation and Bayesian theorem but it is not clear that the calculation is done as is described by Naoki. So I just want to know how the calculation is done and why.
>> it is not clear that the calculation is done as is described by Naoki <<
Naoki is one of the POPFile developers so I reckon he knows what he's talking about.
Are you looking at the _current_ code?
The current code is _not_ on SourceForge, it is at the new site (http://getpopfile.org).
POPFile 1.1.0 is the current release and you can browse the source for the POPFile engine at http://getpopfile.org/browser/tags/v1_1_0/engine
1. The first place to start reading is "How POPFile does email classification" which you can find here: http://sourceforge.net/docman/display_doc.php?docid=13334&group_id=63137
2. Then read "Bayes Theorem and Logarithms" which you can find here: https://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137
Those two documennts give a general overview of the POPFile algorithm.
3. Take a look at the source code for Classifier::Bayes which you'll find here: http://getpopfile.org/browser/trunk/engine/Classifier/Bayes.pm
4. Perhaps read the DDJ article "Naive Bayes Text Classification": http://www.ddj.com/development-tools/184406064
5. The only part these documents do not describe is how POPFile handles the probability of a word that it has not previously seen. This is described in the FAQ: http://getpopfile.org/docs/faq:newwords
Log in to post a comment.