A certain type of spam is consistently getting through my popfile filter--getting classified as "tennis" of all things, even though I keep reclassifying them as spam. So, I clicked on a few of them to see what words are getting popfile confused.
Based on word coloring in the html, I can see that lots of words in the header were used to classify the message, but no words in the actual text of the message are colored, so I believe they did not contribute to which bucket it went into. Maybe the spammers constructed the message in a funny way just to confuse our filters. I'm pasting an example message below. Since the colors matter here, I'll also post the HTML saved from the popfile web interface, and make a link from this bug report.
The text that I expected to be scanned, which would have surely made this class of messages go into the spam bucket, is in a section called "Content-Transfer-Encoding: quoted-printable." It starts with "Discover a wide range..." and ends with "Enter our watches shop!" None of those words are colored at all, so I believe popfile's message parser did not ever look at those words. This makes it unable to classify this category of message well.
Overall, popfile has saved me and my wife endless trouble by categorizing our spam messages with very few false positives. Because my email address is listed in several places on the net, like RPM files I released, I get 90% spam, and reading email without popfile would be intolerable. Thanks!!!!
Example message:
Return-Path: <earl@surecom.com>
Received: from bd04bc77.sts.virtua.com.br (bd04bc77.virtua.com.br [189.4.188.119] (may be forged))
by my.host.com (8.12.8/8.12.8) with ESMTP id lA7LwpOu006497
for <--my-address-->; Wed, 7 Nov 2007 16:58:52 -0500
Received: from [189.4.188.119] by ns2.xpedite.com; Wed, 07 Nov 2007 21:58:33 +0000
Message-ID: <000901c82189$016842e9$920924b9@olnft>
From: <earl@surecom.com>
To: <--my-address-->
Subject: My watch arrived today
Date: Wed, 07 Nov 2007 20:11:11 +0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0006_01C82189.0167599C"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.3790.2663
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.2757
Status:
This is a multi-part message in MIME format.
------=_NextPart_000_0006_01C82189.0167599C
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Discover a wide range of top quality replica watches
offered at discount prices. Among the timepieces on
sale make sure to choose a replica watch of a famous
brand featuring exactly the functions and style you need.
Hurry up or we will run out of watches at stock of our on-line shop.
Enter our watches shop!
------=_NextPart_000_0006_01C82189.0167599C
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.3790.2759" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY>
<P><FONT face="Arial">Discover a wide range of top quality replica watches <BR />offered at discount prices. Among the timepieces on <BR />sale make sure to choose a replica watch of a famous <BR />brand featuring exactly the functions and style you need. <BR />Hurry up or we will run out of watches at stock of our on-line shop.</FONT></P> <P><A href="http://althynezzone.com/"><FONT face="Arial"><STRONG>Enter our watches shop!</STRONG></FONT></A></P></BODY></HTML>
------=_NextPart_000_0006_01C82189.0167599C--
saved html from popfile interface
Logged In: YES
user_id=185114
Originator: YES
File Added: popfile-bug1.html