Project of the Month, May 2003

POPFile Logo

Background of leader(s):

Name: John Graham-CummingJohn Graham-Cumming
Age: 35
Occupation: VP of Engineering at Electric Cloud, Inc.
Education: BA/MA Mathematics and Computation and DPhil in Computer Security from Oxford University
Location: New York (USA)

Stanley Krute
Age: 52
Occupation: Self-employed artist/writer/educator/computist
Education: BA Studio Art Stanford University, Teaching Credential Program – Art, Mathematics – UC Davis

Sam Schinke
Age: 22
Occupation: Working student
Education: First year of computer-science study, with a heavy dose of philosophy courses on the side
Location: Vancouver, BC, Canada

How has SF.net helped you?
All of the infrastructure for an Open Source project was in place and free. Having a ready made bug database, CVS repository, mailing list system and forums saved a *lot* of work. Plus we have received excellent response on any of our issues with SF.net as a project; be they concerns, questions, and/or comments about the SF.net tools themselves. Another bonus is SF.net’s documentation. It has helped us to get developing with the various tools faster.

The number one benefit of SF.net for your project is:
John:
Exposure… being in the top 10 most active projects on SourceForge.net gets a lot of people to hear about POPFile.

Stanley:
I truly love SourceForge.net. I first heard of it a couple of years ago, and started working on my first SourceForge.net project soon thereafter. The thought of so many programmers working on so many Open Source projects puts a huge grin on my face. SourceForge.net is the alpha hive of the Open Source revolution. It’s the first place I look when the need arises for a piece of software to solve a new problem.

Sam:
Exposure, for sure. Many of our present contributors seem to have found POPFile while looking at other SF.net projects, and certainly a majority of our users.

Email is the “killer app” of the Internet, but keeping up with the shear volume of incoming email arriving in one’s inbox, particularly with the avalanche of Spam, is a daunting task. Popfile is a powerful, learning, application designed to make sorting incoming email an automatic task. Simply teach Popfile how you want your email sorted into folders and Popfile does the rest. With users reporting automatic sorting as high as 98% accuracy, Popfile is truly a universal timesaver. Popfile has only been on SourceForge.net for the past 6 months, but given the tremendous need for such a product on the net and it’s ability to work on all major computing platforms, it has enjoyed a top ten SF.NET project ranking essentially since it’s inception.

Project Name: POPFile
Founded / Started:
August 2002

URL: http://popfile.sourceforge.net/
Project page: http://sourceforge.net/projects/popfile/

Description of project
POPFile is a proxy program that sits between a user’s email client and email server. It uses a statistical approach to sort a user’s email into an arbitrary number of categories.

Trove info:
POPFile works on Windows, Macintosh, and Unix/Linux systems. It is written in perl. It uses a browser interface, emitting validating HTML 4.01 and CSS.

How did you get started?
John:
While I was working at Scriptics (the Tcl company), I was receiving a large amount of email and went searching for an automatic sorting solution. The only one available at the time was iFile, which required that I use exmh. As a corporate user of Microsoft Outlook, I needed something that would play well in the Microsoft environment, so I ended-up downloading the libbow toolkit from CMU and writing a COM+ plug-in for Outlook, which I called AutoFile.

Later I got to thinking about how long it took to download my POP3 email through a 56k line and came to the realization that message downloads ought to be sorted into ascending order of size, not delivery time, so that the small mails arrive first and you can be dealing with them while the big ones download. That lead to the creation of POPtimize, a POP3 proxy written in perl and still available from www.extravalent.com.

In August 2002, I realized after a newsgroup discussion that I could merge AutoFile and POPtimize to created a POP3 proxy that would work with any email client and do automatic email sorting. Hence, POPFile was born. In September 2002 I registered the SourceForge.net project for POPFile and released the code under the GPL.

Stanley:
A little over a year ago, in Jan of 2002, I started to work part-time on spam filtration. I’ve been using email for 26 years, and finally had had it: I was getting a couple of dozen spam a day. Even more importantly, friends, family, and clients were starting to get spam, and its content was particularly unsettling to them.

Like many folks, I started out thinking that a set of rule-based filters might work quite well. I began implementing such a set in Outlook Express. Things went well at first, and eventually I was able to catch about 97% of my fairly voluminous spam flow with a set of 800+ rules. But there were a number of vexing problems with the rules-based approach: filter set degradation, filter set maintenance, and the need for large ‘whitelists’ to avoid false positives.

In August of 2002, I saw Paul Graham’s paper on Bayesian filtration. It described a statistical approach to spam filtering. It sounded quite promising. And it appeared to solve the issues I’d come across with rules-based filters. But there were no implementations out there that I could get my hands on to check it out.

In September, I decided to take what I’d learned from working with rule-based filters in Outlook Express, mix it with some statistical elements, and build a small pilot application. It quickly became clear that an optimal path was to build a POP3 proxy to intercept the email stream. I’d never written such a proxy, so I “googled” around to see if there were some Open Source implementations. After hitting a number of dead ends, I found POPFile.

Wow! Not only was POPFile a POP3 proxy, and Open Source… but it was aBayesian email filter!!! And it was on SourceForge.net, my favorite softwarehive. Talk about serendipity. I downloaded it, loved it, was happy to see that the source code was coherent, and quickly decided it would be much more fun to put my energies into the POPFile project than go and reinvent wheels on my own.

I started in working on POPFile’s HTML and CSS user interface code. Since it’s important to have POPFile as usable as possible in as many environments as possible, clean validating HTML and CSS code is a must. I’m quite happy that, at this point, POPFile’s UI validates to HTML Transitional 4.01 and CSS. Also, thanks to a huge chunk of work by David Smith, it meets an important accessibility standard, Bobby level.

Sam:
I was involved in the early beta-testing of John’s first versions and found myself learning perl and liking it. From finding bugs, and the specific piece of code that causes them, to fixing the bugs in a moderately-sized piece of open software is an easy step, even if you aren’t proficient in the specific language. However, I do suggest some programming experience though. :)

What is the intended audience?
Ultimately, anyone who uses email and wants some help classifying it, whether they are a sophisticated user or not. POPFile is designed to be used by the average Windows user and the hard core Linux geek. Each release of POPFile gets easier to install and use, so that ‘anyone’ more and more truly means anyone.

What does POPFile do?
POPFile is an automagic email filter that learns. It is a POP3 proxy that automatically sorts email by adding a message header to downloaded messages indicating into one which of ‘n’ number of categories the message should be placed. The user tells POPFile which category, or bucket, particular email messages should go in. After a while, by analyzing the words contained in each categorized message, POPFile is able to start sorting messages without user input. By using a Naive Bayes text classifier and coupling it with an easy to use web based interface it’s possible to rapidly teach POPFile by example the different types of mail you receive. If the user provides consistent training, POPFile becomes rather smart about its sorting within a few hundred messages. Within a few thousand messages, it can get better at email sorting than many humans.

What makes POPFile unique?
POPFile is unique for two main reasons. It allows you to automagically sort your email by teaching it like a child and it’s cross-platform ability. Show POPFile examples of your mail and it quickly learns what mail should go where. We want to make it easier and easier to set up and use. We want to make sure it works with as many types of email setups as possible. Unlike mail filters or solutions dedicated to spam removal POPFile is flexible and accurate, it allows you to sort into an arbitrary number of categories.

How many people do you believe are using your software?
There are 45,000 downloads of POPFile, but only about 10,000 downloads of the most recent version. It is hard to say how many are still using older versions, but no doubt some are. The total user-base currently is probably between 10,000 to 15,000 (March 2003).

What gave you an indication that your project was becoming successful?
John:
The first indicator was that the download count from SourceForge.net was rapidly increasing from 10s to 1000s. The second was when the CTO of a very large corporation emailed me to tell me how much he liked POPFile and I discovered that he’d heard about POPFile in a bar.

Stanley:
Seeing its position in the SourceForge.net ratings.

Sam:
My first indication was when I was no longer able to read all of the traffic in all of our forums. That, and the fact that we were getting bug-reports from people obviously unused to filing bug reports. If many mainstream software users are trying out beta Open Source software and liking it enough to report any trouble they have, I am pretty happy.

What has been your biggest surprise?
John:
A guy wrote to me from Europe to tell me that he’d modified POPFile to work with his cell phone so that he didn’t have to read spam on the tiny screen. He donated money because POPFile’s was directly saving him money, every spam downloaded onto a cell phone costs money because the subscriber pays by the byte. POPFile really helped him take a byte out of his spam. (Stop me before I pun again).

Stanley:
How well the program works. At this point, about a third of POPFile users are getting 98%+ accuracy. And about a sixth of users are getting 99%+ accuracy. That’s truly amazing. Most humans can’t manually sort their email with that level of accuracy. And it’s accomplished in the face of some remarkable efforts by spammers to mask the contents of their messages. John and Sam have done a remarkable job putting code into POPFile that unmasks message content. It’s a testament to the power of regular expressions. One could easily write a “Practical RegExxing” book based on the POPFile source code.

Sam:
The irresistible urge people have to try to use POPFile’s Bayesian classifier for machine-trivial classifications. If messages belonging to a bucket ALWAYS have a single predictable property, a simple filter will always work much better. It is for this reason alone that I am not opposed to the inclusion of magnets (simple pattern-based rules) in POPFile. It lets people leave only one simple set of rules in their email client. Bayesian classification is much better suited for tracking decisions that really do need subjective human input. Is this email interesting to me? Does this email pertain to a certain interest of mine? Is this email urgent? Is this email the same topic as those other emails?

What has been your biggest challenge?
John:
Making a perl script usable by the average Windows user. It was not possible to tell the average person used to a SETUP.EXE to download ActivePerl and then install the scripts and type perl popfile.pl. That’s too much for 90% of computer users; part of POPFile’s world domination is predicated on the need to infect the Windows desktop with simple email classification. All I can say is thank heavens for the NSIS SuperPIMP installer.

Stanley:
Finding time to work on the project. I’ve got a fairly time-crunched existence, to put it mildly.

Sam:
Right now I am working on a suite to provide an objective test of POPFile’s accuracy, so that different classification and mail parsing strategies can be objectively tested.

What are you most proud of?
John:
An ISP wrote to me to tell me that his company had decided not to use a $30,000 piece of spam removal software but was going to donate $300 per month to the POPFile project for their use of POPFile. That was very cool. Equally great is being able to take an Open Source project out of the hands of geeks and put it into the hands of the average Outlook Express user. With all the talk about Linux’s success it’s easy to forget that Open Source software has had little impact on the general desktop user, POPFile is a true “cross over” project being GPL Open Source and easy to use.

Stanley:
That such a talented group of developers, testers, and hardcore users is working so hard to create and give away a polished tool that gives folks back their inboxes

Sam:
The speed with which POPFile’s development has progressed. I think we are all pleasantly surprised and proud of everyone’s contributions to date.

Why do you think your project has been so well received?
It works. Email is the killer app. on the Internet; every single email user receives a bunch of email in their inbox with absolutely no intelligent sorting. It serves a pressing need. It’s relatively easy to install and operate and it’s cross-platform. The other main draw has been friendly anti-spam technology that doesn’t operate using some “mystery” input (e.g., signature updates from a central server, distributed checksums, regexp recipes) but rather learns to be extremely attentive to the individual decisions.

Where do you see your project going?
John:
To infinity and beyond! :)

The road ahead consists of:

  1. Support for users with need for special accessibility. The plan is to ship POPFile with AA Bobby rating.
  2. More and more languages, I would especially like to see Asian and Middle Eastern languages covered (any volunteers to do Japanese, Chinese, Tagalog, Hebrew and Arabic?)
  3. A total re-architecture of the POPFile core to make use of more of POPFile’s OO and really set up the boundaries between the objects correctly and also make use of inheritance to simplify writing a new POPFile Loadable Module.
  4. Support for IMAP and SMTP protocols.
  5. Automatic configuration of Outlook, Outlook Express, Eudora, Mozilla, Pegasus… on Windows.
  6. A SOAP/XMLRPC/ Web Service interface so that applications can use POPFile’s core functionality.
  7. Client plug-ins that leverage (6) so that a user of Outlook could click Outlook menus to access POPFile without having to go to the web interface.

Stanley:
POPFile at its core is a powerful pattern recognition learning engine. It will be interesting to see how we can harness that engine to all sorts of data-handling chores. I’d like to have POPFile filter RSS news items and NNTP posts, for example.

Sam:
In addition to John’s thoughts, I see integration with MTA/MXs at the ISP level allowing ISPs to replace relatively high-maintenance spam solutions with user-maintainable POPFile accounts. Getting there will require significant work as yet, but the work is mostly work that we are planning anyways, and the end result is attainable. I am with John in believing that integration and acceptance at the client/desktop level will be a huge boost for POPFile, but I think acceptance by ISPs would be even bigger, and have an even larger impact on junk mail.

We already have some corporate network admins performing experiments with centrally managed POPFile on their MTAs, and their bosses have been very impressed. But I don’t think centrally controlled filtering much less any requiring manual review of email content would be appropriate at an ISP level.

How can others contribute?
John:
POPFile has a POPFile Developers Guide available in the Docs section of the POPFile project on SourceForge.net. Anyone interested in contributing needs to read that document first and then drop in on the Bleeding Edge – Source Code forum where the developers meet. Make a suggestion there and then get coding.

Stanley:
I have a strong belief that people do their best work when they’re working on projects that they love. Folks who try out POPFile, and love it, will come across things they think can be improved. At that point, they should make some suggestions in the Bleeding Edge forums where we coordinate work on the project. Chances are good someone will say “Go for it”. We can use help on all aspects of the project: coding, documentation, testing, and user support.

Sam:
We have loosely defined plans for the long-term future of POPFile that would go smoother with the help of folks who are particularly knowledgeable in certain areas. Folks knowledgeable in other areas are always welcome, too! To fully handle extended character-set mail it will be necessary to use UTF-8/Unicode for our message handling. Help or advice in this area would be welcome. I think we could also use somebody quite experienced in multi-user Linux/BSD environments if ISP use is to become a reality.

Further, since the heart of POPFile is a statistical engine, any statisticians would be welcome. There is currently some examination of alternative or supplementary classification strategies, and more informed input would be great. We have some tools to give “results based” answers to whether one approach is better than another, but we can’t always tell _why_. To “sign up” to help in any way, visit POPFile’s Bleeding-edge forums on SourceForge.net. You can lurk, post ideas, issue patches or contribute however you feel comfortable.

Do you work on POPFile full-time, or do you have another job?
John:
POPFile is a part time project for me. I’d love it to be my day job, but it doesn’t pay the bills…yet.

Stanley:
It is part time for me. I run a small computer hardware/software/training web development company, and am also quite active in animal rescue — I’ve currently got 18 dogs and half a dozen cats in residence; it’s a bit like being a dairy farmer in terms of time commitment.

Sam:
My work on POPFile is part-time.

If you work on the Open Source code part-time, how much time would you say you spend, per week, on the project?
John:
I spend about 1 hour a day answering email during the week (and thank goodness for POPFile since I need to sort through about 400 mails a day) and then I spent *a lot* of time at the weekend working on the code. I probably spend 20 hours a week total coding, answering emails and re-architecting in my head.

Stanley:
I probably average 20 hours/week doing POPFile work right now. But that varies, depending on how screaming my other tasks are.

Sam:
10-20 hours. Some weeks I don’t have time for much work at all though, and other weeks I go well beyond that.

How do you coordinate the project? Make assignments? Assign bugs? Perform regression testing?
From my secret underground lair I communicate almost exclusively with the team of developers and patchers through the SourceForge.net forums. POPFile has three special developer oriented forums: Bleeding Edge – Source Code, Bleeding Edge – documentation and Bleeding Edge – UI.

I generally assign specific tasks through those forums and also often respond to people who wish to fix a particular bug. Many users submit patches through the SourceForge.net patch system which I also look at.

POPFile has a regression test suite written in perl that can be accessed by running ‘make test’ in the POPFile engine directory. We do not yet have 100% code coverage in the test suite and shortly I will stop accepting patches that do not have an associated test suite.

What is your development environment like?
John:
I do everything on a Dell Inspiron 8000 (750Mhz) running Windows XP with ActivePerl, on the same machine I test using cygwin and the cygwin distribution of perl and I run VMWare with two configurations: Windows 98 (for testing POPFile installer compatibility with the DOS versions of Windows) and Linux (for testing POPFile on a *nix system).

Stanley:
My computers are all Stanley Screamers, designed and built by my little company. The one I do most of my POPFile work on has got a 2 GHz AMD Athlon, 768 MB RAM, and four RAIDed hard drives. I use UltraEdit32 for editing my source code. I use ptkdb for perl debugging.

Sam:
I’m on a Pentium III – 600, 384 mb ram, Windows 98, with a 19″ monitor. ActiveState and Cygwin perl. PTKDB and perl’s native debugger suit my debugging needs. WinCVS for CVS, Putty for SSH, and any syntax-highlighting and subroutine tagging text editor as an IDE.

If you could change one thing about the project, what would it be?
Add a bio-feedback based UI (web cam, microphone) which would classify messages based on unconscious facial expressions or conscious verbalizations verbalizations…. :) Short of that it would be to make the installer far easier to use and integrate directly with Outlook Express so that the average user can get up and running with POPFile without knowing anything about configuration.

What’s on your project wish list?
John:

  1. POPFile on every desktop.
  2. 100% of email runs through POPFile.
  3. Ally Sheedy confirms that she’s a POPFile user and thinks that I rock!

(I’ll settle for 1 and 2)

Sam:
I wish for 99%+ accuracy for 95% of users after the first 100 email messages per bucket. That’s less than 1 error per 100 messages after a 200 message training period for a user who only uses POPFile to separate spam from non-spam. And 99.9% accuracy after the 1000 messages per bucket mark. Believe it.

Milestones:
Look to the release of future versions and the branching of development to bring a production-quality version of POPFile, while more radical development continues to bring about a new and much better beta.