Alan Beale - 2003-10-17

A month ago, I started a thread in which I proposed that POPFile should monitor its own accuracy, and use that knowledge to rate the "confidence" it has in its classifications.  This is another approach to the problem that the chi-square feature attempted to solve, that is, to note classifications which might be more dubious (likely to be incorrect) than others.  The patch makes this determination, not on the basis of the Bayesian algorithm, but on POPFile's own past performance.

I have now submitted patch 825662 to address this. 

http://sourceforge.net/tracker/index.php?func=detail&aid=825662&group_id=63137&atid=502958

This patch is a prototype only.  I do not expect it to be accepted, nor do I recommend this.  I'm putting it out there so that interested parties can try it out, indicate whether it seems to have any value, make suggestions for improvement, rewrite the code, etc.  Even though I came up with the idea and implemented the patch, I am not sure whether this is really an improvement or not, though there are aspects of it I really like.

An overview of the operation of this patch is as follows:

1.  POPFile is modified to keep lifetime statistics of messages
classified (other than by the use of magnets), false negatives and
false positives.  These statistics are maintained independently of the
Buckets page statistics.  The statistics are updated when a message is
reclassified or deleted.  If, on startup, POPFile finds there are no
lifetime statistics, it copies the Buckets page statistics.

2.  Using the record of its accuracy maintained in the statistics,
POPFile assigns a confidence level to each non-magnet classification.
The confidence level is classed as HIGH, MED or LOW.  The confidence
level is shown together with the bucket on the History page.

3.  The confidence level is displayed in the Message View page.  If
the classification or confidence level of a message has changed since
its original classification, this fact is noted.

4.  The bucket statistics display is augmented to include the numeric
confidence factor associated with each bucket.

5.  The new code to "unclassify" a message if the probabilities of the
two most likely buckets are close has been disabled, not because it
isn't a good idea, but because it is interesting to see how the new
confidence code handles messages with this degree of uncertainty.

Here are some issues in the current implementation.  These are areas
where the implementation could possibly be improved.

1.  Regardless of the statistics, POPFile never allows a bucket's
confidence factor to become either 1 or 0.  The exact way this is done
is somewhat arbitrary.

2.  Priming the lifetime statistics from the existing statistics
distorts the results when a bucket has associated magnets.  An
alternate method, starting the statistics at zero, handles this case
more correctly, but has the disdvantage of making even a very
well-trained POPFile initially uncertain.

3.  Possibly, rather than using lifetime statistics, a moving window
of statistics should be used, so that POPFile's confidence is not held
back by early mistakes.

4.  Because the lifetime statistics are not recomputed except when a
message is reclassified or deleted, they may be significantly out of
date for users who have a long interval for expiring messages.  One
could consider having a separate time interval for accepting the
results of classifications which have not been changed, or having a
"Confirm All' button to inform POPFile to update the statistics.
There is even a case for having an "Accept Classification" button in
the message view, to allow one to scroll through the messages with
Next, accepting or reclassifying each one.

5.  The display of the confidence level in the History page is not
very attractive.  I tried displaying the numeric confidence level
instead, and that was worse.  Once could add another column for this
information, but I felt the cost in screen real estate was too high.

6.  The Message View page only notes when the HIGH/MED/LOW level of
the confidence level has changed, not when the numeric value has
changed. I feel this is better, but others may feel differently.

7.  The confidence factor shown on the Buckets page is "out of synch"
with the other statistics.  An alternative is to compute a confidence
factor based on the Buckets page statistics.  I think this is a bad
idea - the values shown should relate to the values used in recent
classifications.  Also, only a few mathematically inclined users are
likely to notice the discrepancy.

8.  One could allow the confidence level to affect the classification
directly.  For instance, imagine the following hypothetical
breakdown:

Bucket      Probability        Confidence

spam . . . . . 70.0 . . . . . . . 0.210
personal . . . 30.0 . . . . . . . 0.240

In this case, POPFile's accuracy record for being right with a
classification of personal is 80 %, while for spam the accuracy is
only 30 %.  One could choose to make POPFile classify this message as
Personal, but I did not do so.  It makes the implementation harder,
and it also makes the confidence level partially reflect its own
accuracy, rather than merely that of the Bayesian algorithm.  As the
situation described will come up relatively seldom, I don't think any
such change is worth the trouble, even if one believes it would be an
improvement (which I do not).

If one believes this feature is an improvement to POPFile, here are
some possibilities for ways in which the feature could be
extended/made more useful.

1.  The confidence level could be recorded in a new header
(X-POPFile-Confidence?).  This was one of the motivations for reducing
the confidence to LOW/MED/HIGH, which is simpler for most email
filtering systems to handle than an arbitrary numeric value.

2.  Provision could be made for sorting the history by confidence. 
(In practice, this is likely to cause all mail for the same bucket to
be sorted together.)

3.  It would be nice to allow users to define the boundaries between
the confidence levels.

4.  The GUI could certainly be improved to present the confidence
level in a better way.  I'm not at all happy with the new appearance
of the History page, but I'm not a user-interface kind of guy, and am
out of ideas for how to improve it.

If you want to try out the patch, you should clear your history and remove the history cache before bringing it up for the first time.  To simulate the effect of the patch on a new POPFile user, reset your statistics before bringing the patched version up for the first time.  (Of course, for a real emulation of the new user experience, you should also discard your corpus before bringing up the patch.)

Please let me and John know if you see any value in this, or if you have ideas about how to improve it.