2009-04-04
2013-04-15
• Richard Cervenka - 2009-04-04

I’m just writing my diploma thesis where I’m describing decision process of POPFile. But I have no idea how the Score is calculated. I have read through whole forum, but I did not find any example of Score calculation. I have also read the source code where I found following explanation:

# For each word go through the buckets and calculate
# P(word|bucket) and then calculate P(word|bucket) ^ word count
# and multiply to the score

But anyway I am not able to do it manually and explain how the calculation is done exactly. So please could you give me an example of Score calculation. Thank you very much. It is very important for me.

• Brian Smith - 2009-04-04

>>  I have read through whole forum, but I did not find any example ... <<

There is a separate "documentation" section here:
https://sourceforge.net/docman/index.php?group_id=63137

http://sourceforge.net/docman/display_doc.php?docid=13334&group_id=63137

The POPFile project has a new website at http://getpopfile.org

Brian

• Richard Cervenka - 2009-04-04

According these documents I already tried to recalculate score for one email. I chose one word and tried to found out how the frequency, probability and score for this word were calculated. I have 7 buckets. Occurrences of this word is in buckets:

B1:68, B2:5, B3:16, B4:2, B5:9, B6:0, B7:1 (number of times my word W appear in each bucket)

Other specifications:
B1:33145, B2:49145, B3:33760, B4:32930, B5:56339, B6:34059, B7:23031 (total number of words)
B1:8318, B2:12578, B3:8988, B4:8608, B5:14984, B6:9899, B7:6912 number of unique words in each bucket

Number of all unique words: 70287
Number of all words: 262409

The results from POPFile for my word are following:

B1:    B2:    B3:    B4:    B5:    B6:    B7:
Frequency    0,00205    0,0001    0,00047    0,00006    0,00016    none    0,00004
Probability    0,6733    0,0495    0,1584    0,0198    0,0891    none    0,0099
Score    3,7311    2,4265    3,0947    2,2024    2,6224    none    2,0567

I found out, that frequency is count as number of times word W appears in bucket divided by total number of words in bucket. And probability is count as number of times word W appears in bucket divided by total number of times word W appears all buckets. But I have no idea how the score is calculated. Please could you help me with this? It is very important for me. Thank you!

• naoki iimura - 2009-04-05

Hi

The score is calculated by below formula:

score = (log(frequency) - not_likely) / log(10)

where not_likely = -log(number_of_all_words * 10).

score1 = (log(0,00205)+log(262409*10))/log(10) = 3.7307

In other words,

score1 = log(68/33145 * 262409*10)/log(10) = 3.7311

Naoki

• Richard Cervenka - 2009-04-05

Thank you very much. It is helpful for me, but I’m not sure, where and why I get number 10 by which I multiply number_of_all_words and than I divide by log(10).

• Wm - 2009-04-05

The whole point of homework is that people should do it themselves!

• naoki iimura - 2009-04-05

The 'not_likely' is the frequency for unseen words in the bucket and POPFile assigns 1/(10*number_of_all_words) as its value:

http://getpopfile.org/docs/faq:newwords

Dividing by 10 converts natural logarithm to common logarithm.

• Richard Cervenka - 2009-04-05

I am sorry, but I spend several hours trying to recalculate these numbers according to documentation and Bayesian theorem but it is not clear that the calculation is done as is described by Naoki. So I just want to know how the calculation is done and why.

Richard

• Brian Smith - 2009-04-05

>> it is not clear that the calculation is done as is described by Naoki <<

Naoki is one of the POPFile developers so I reckon he knows what he's talking about.

Are you looking at the _current_ code?

The current code is _not_ on SourceForge, it is at the new site (http://getpopfile.org).

POPFile 1.1.0 is the current release and you can browse the source for the POPFile engine at http://getpopfile.org/browser/tags/v1_1_0/engine

Brian

• 1. The first place to start reading is "How POPFile does email classification" which you can find here: http://sourceforge.net/docman/display_doc.php?docid=13334&group_id=63137

2. Then read "Bayes Theorem and Logarithms" which you can find here: https://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137

Those two documennts give a general overview of the POPFile algorithm.

3. Take a look at the source code for Classifier::Bayes which you'll find here: http://getpopfile.org/browser/trunk/engine/Classifier/Bayes.pm

4. Perhaps read the DDJ article "Naive Bayes Text Classification": http://www.ddj.com/development-tools/184406064

5. The only part these documents do not describe is how POPFile handles the probability of a word that it has not previously seen.  This is described in the FAQ: http://getpopfile.org/docs/faq:newwords

John.