Notes:
Changes: ( rc 1d ) (changed) rather than modifying the subject line of incoming emails, I opted to insert the SpamBayes.sourceforge.net style X-Hammie-Disposition with a 'yes SUSPECTEDSPAMxxxx' or 'no '. yes for a spam, with the probability of xxxx/10000, and no if not a suspected spam. (other) speaking of spambayes, I've joined their mailing list and started talking a bit about different issues. There are quite a few people there that are amazingly good with statistics and making the statistics work, much of which I don't understand. If you are looking for something that works better theoretically, head over to spambayes. Though for simplicity, it's hard to beat PASP. (added) I added a utility for converting unix mbox format files (separated by 'From - ' lines) to pasp format files ( separated by '\n.\n' ) It's called mbox2pasp.py Also created the converse, pasp2mbox.py (fixed) the LIST command. Pasp now adjusts the listed mail sizes to include the X-Hammie-Disposition... addition. It's only 44 bytes/email, but as Richie Hindle of the spambayes mailing list says, 'there may be an email client out there that malloc()s just enough space'. I didn't adjust the STAT command because different servers respond differently (some use '+OK 5 messages (2352 octets)' others use '+OK 5 messages 2352 octets'. I know because I've got two pop accounts, one does one, the other does the other. (cleanup) I cleaned up the code for inserting the header. It's now nice and tight. (update/fixed) Currently all of the filtering options can be removed (if you just want a pop3 proxy without spam categorization) by changing the 'FILTER' variable in the pasp.py source to 0. This has been the case for a while, but there were a few things that should have been repaired that now have been. (potential future option) The current code waits for the next total block of information before it forwards it...that could be the response from a server, a command from a client, the entirety of an email, etc. While this makes sense when there is filtering going on (it pays to keep track of what is going on in chunks), for non-filtering, it makes little sense except for receiving commands from the client (they are always single-line and less than 100 bytes long). It's really bad when your mail client times out because the proxy is downloading a GIANT email, client then disconnecting, and the proxy is left holding the bag (I am pretty sure PASP handles this well). Again, Richie Hindle suggested having an internal timeout for starting to forward incomplete replies. This makes sense, but requires a bit of work. I'm contemplating this. (changed) I changed the testing code to not require different testing/database corpora. I use an idea from spambayes and check email in the spam/ham against the index, remembering to remove the email from the index before testing it (though they do subsets, I do it all...it takes about 25 minutes with all three categorization techniques, 6,000 emails on a celeron 400). On my spam and nonspam emails, I'm getting about 2% false negatives, and zero false positives. This could change as my spam and nonspam corpora increase in size. (added) For those of you with wxPython installed, you should get a dialog when the pasptest.py completes executing that tells you that it's done. (added) the testing module enumerates emails as it is checking them. This allows you to get somewhat of a status indicator. Remember that it's using 3 algorithms for deciding on whether the email is spam or not. (changed) I changed the format of many of the commands, mostly to make them a bit more portable and to potentially reduce processing requirements (mostly during the pasptest.py testing).
Copyright © 2009 Geeknet, Inc. All rights reserved. Terms of Use