Junya Ishihara - 2004-01-30

POPFile is mentioned in Nikkei Byte Feb. 2004 issue, a Japanese magazine about computer and technology.

Just the index can be found on the web:
http://store.nikkeibp.co.jp/mokuji/nby249.html

POPFile is mentioned as one of spam filters in the article titled "How much can spam be filtered?". The following 7 software are compared in the ability of spam filtering, which are Eudora 6J, Norton AntiSpam 2004, Virus Buster 2004(I think it is Japanese version of Trend Micro PC-cillin Internet Security), McAfee SpamKiller, Outlook 2003, Mozilla 1.5, and POPFile 0.20.1. POPFile got the highest score after training using 30 spam mails and 30 normal mails. POPFile got a decent accuracy with Japanese emails with Norton AntiSpam, Virus Buster, while the other software did not work at all.

It is very interesting article because I have never seen such article that did a thorough examination of spam in Japanese. The article points out 3 issues that all spam filtering software currently have and need to solve in the future if possible.

1) Shorten the training period, 2) Handle Japanese emails more correctly, 3) Semantic analysis.

I personally think that POPFile can solve issue 1) by providing easier way to use insert.pl. Currently it is a tool for advanced users, but if it becomes easier to use, for example, by providing UI to use it, most users will experience higher accuracy when they start to use POPFile.

2) is rather not fair because only 20 Japanese emails(10 as normal, 10 as spam) were used for training. If more emails were used, I think POPFile would get much higher accuracy even with Japanese emails with power of Kakasi, a Japanese language processing filter, and Bayesian classification. However, I agree that there are many things to do to improve Japanese handling. To make insert.pl support Japanese emails is one thing. To support other encodings(UTF-8, EUC-JP, Shift_JIS) than ISO-2022-JP is another thing.

3) is chanllenging. In the article, 2 emails for testing were artificially created. One is a complete spam mail, and another is a warning message from MIS quoting this spam. All software handled these 2 emails as spam. The article says semantic analysis can be applied to distinguish these 2 similar emails.