tess-developers Mailing List for The Spam Secretary
Brought to you by:
kwerle
You can subscribe to this list here.
| 2003 |
Jan
(7) |
Feb
|
Mar
|
Apr
(2) |
May
(2) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: <ben...@id...> - 2004-05-21 08:13:29
|
Dear Open Source developer I am doing a research project on "Fun and Software Development" in which I kindly invite you to participate. You will find the online survey under http://fasd.ethz.ch/qsf/. The questionnaire consists of 53 questions and you will need about 15 minutes to complete it. With the FASD project (Fun and Software Development) we want to define the motivational significance of fun when software developers decide to engage in Open Source projects. What is special about our research project is that a similar survey is planned with software developers in commercial firms. This procedure allows the immediate comparison between the involved individuals and the conditions of production of these two development models. Thus we hope to obtain substantial new insights to the phenomenon of Open Source Development. With many thanks for your participation, Benno Luthiger PS: The results of the survey will be published under http://www.isu.unizh.ch/fuehrung/blprojects/FASD/. We have set up the mailing list fa...@we... for this study. Please see http://fasd.ethz.ch/qsf/mailinglist_en.html for registration to this mailing list. _______________________________________________________________________ Benno Luthiger Swiss Federal Institute of Technology Zurich 8092 Zurich Mail: benno.luthiger(at)id.ethz.ch _______________________________________________________________________ |
|
From: <kw...@us...> - 2003-11-12 08:03:35
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv15539
Modified Files:
TheSpamSecretary.py
Log Message:
Tweak to display of database (--showbad, etc). Better handling of mime mail. Slight change to miniumum spam value so as not to dump friend's email
Index: TheSpamSecretary.py
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** TheSpamSecretary.py 18 Jun 2003 23:34:55 -0000 1.16
--- TheSpamSecretary.py 12 Nov 2003 08:03:07 -0000 1.17
***************
*** 46,49 ****
--- 46,50 ----
import user
+ import md5
import StringIO
import multifile
***************
*** 157,161 ****
return .4
if (bad_count == 0):
! return .0101
if (good_count == 0):
return .99
--- 158,162 ----
return .4
if (bad_count == 0):
! return .0089
if (good_count == 0):
return .99
***************
*** 255,259 ****
keys.sort()
for someKey in keys:
! print("%s: %s" % (someKey, someDict[someKey]))
print("Message count: %s" % someDict['TeSSFileCount'])
--- 256,263 ----
keys.sort()
for someKey in keys:
! try:
! print("%s: %s" % (someKey, someDict[someKey]))
! except KeyError:
! sys.stderr.write("Could not find the value for key %s\n" % someKey)
print("Message count: %s" % someDict['TeSSFileCount'])
***************
*** 497,502 ****
self.scanURLs(outputData.getvalue(), self.tempDict)
outputData = self.stripComments(outputData.getvalue())
! #else:
! # print("NO DECODE")
#print(outputData.getvalue())
self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
--- 501,516 ----
self.scanURLs(outputData.getvalue(), self.tempDict)
outputData = self.stripComments(outputData.getvalue())
! else:
! try:
! someData = StringIO.StringIO()
! hashCounter = md5.new()
! mimetools.decode(multiFile, someData, onePart.getencoding())
! hashCounter.update(someData.getvalue())
! if (self.debugFilter):
! sys.stdout.write("Hash value is: %s\n" % hashCounter.hexdigest())
! self.addTokensFromTextToDict(hashCounter.hexdigest(), self.tempDict, "MIMEFILEHASH:")
! except:
! sys.stdout.write("Bad multipart of type %s\n" % onePart.gettype())
!
#print(outputData.getvalue())
self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
|
|
From: <kw...@us...> - 2003-06-18 23:34:58
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv26617
Modified Files:
TheSpamSecretary.py
Log Message:
Better parsing of html mail. Added weight to URLs within html text. Made pure non-spam words worth less than pure-spam words.
Index: TheSpamSecretary.py
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** TheSpamSecretary.py 5 May 2003 15:44:36 -0000 1.15
--- TheSpamSecretary.py 18 Jun 2003 23:34:55 -0000 1.16
***************
*** 77,81 ****
self.debugFilter = 0
self.MINIMUM_GOOD_MESSAGE_COUNT = 40
! self.MAX_WORD_LENGTH = 20
self.myWordDefinition = re.compile("[^\w\-'\$]+") ##" alphanumeric characters, dashes, apostrophes, and dollar signs"
--- 77,81 ----
self.debugFilter = 0
self.MINIMUM_GOOD_MESSAGE_COUNT = 40
! self.MAX_WORD_LENGTH = 40
self.myWordDefinition = re.compile("[^\w\-'\$]+") ##" alphanumeric characters, dashes, apostrophes, and dollar signs"
***************
*** 157,161 ****
return .4
if (bad_count == 0):
! return .01
if (good_count == 0):
return .99
--- 157,161 ----
return .4
if (bad_count == 0):
! return .0101
if (good_count == 0):
return .99
***************
*** 164,168 ****
else:
return 0.0
! return max(0.01, (min(0.99, returnValue)))
#print(self.tempDict)
--- 164,168 ----
else:
return 0.0
! return max(0.0101, (min(0.99, returnValue)))
#print(self.tempDict)
***************
*** 472,475 ****
--- 472,478 ----
self.logFile.write("Failed to decode something of type %s\n" % aMessage.getmaintype())
self.logFile.write("Failed for subject %s\n" % aMessage.getheader('Subject'))
+ if (re.search("html", aMessage.gettype())):
+ self.scanURLs(outputData.getvalue(), self.tempDict)
+ outputData = self.stripComments(outputData.getvalue())
#print("MSXXX:%s:MEXXX" % outputData.getvalue())
self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
***************
*** 484,488 ****
while multiFile.next():
onePart = mimetools.Message(multiFile)
! #print("TYPE: %s" % onePart.gettype())
if (not (re.search("application", onePart.gettype()) or re.search("image", onePart.gettype()))):
--- 487,491 ----
while multiFile.next():
onePart = mimetools.Message(multiFile)
! #sys.stderr.write("TYPE: %s" % onePart.gettype())
if (not (re.search("application", onePart.gettype()) or re.search("image", onePart.gettype()))):
***************
*** 492,496 ****
self.logFile.write("Failed to decode something of type %s\n" % onePart.getencoding())
if (re.search("html", onePart.gettype())):
! outputData = self.stripComments(outputData.getvalue());
#else:
# print("NO DECODE")
--- 495,500 ----
self.logFile.write("Failed to decode something of type %s\n" % onePart.getencoding())
if (re.search("html", onePart.gettype())):
! self.scanURLs(outputData.getvalue(), self.tempDict)
! outputData = self.stripComments(outputData.getvalue())
#else:
# print("NO DECODE")
***************
*** 498,501 ****
--- 502,509 ----
self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
else:
+ if ((aMessage.getheader("content-type") != None) and re.search("html", aMessage.getheader("content-type"))):
+ self.scanURLs(outputData.getvalue(), self.tempDict)
+ outputData = self.stripComments(outputData.getvalue())
+
self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
while (1):
***************
*** 506,509 ****
--- 514,520 ----
someFile.seek(lastPosition)
break
+ if ((aMessage.getheader("content-type") != None) and re.search("html", aMessage.getheader("content-type"))):
+ self.scanURLs(oneLine, self.tempDict)
+ oneLine = self.stripComments(oneLine).getvalue()
self.addTokensFromTextToDict(oneLine, self.tempDict)
#print("oneline: %s" % oneLine)
***************
*** 512,515 ****
--- 523,541 ----
##################################################
+ def scanURLs(self, someText, someDict):
+ """
+ Scan the html text and add any found hosts to someDict.
+ """
+ outputData = StringIO.StringIO()
+ HOSTNAMEDEF = "\w\-\.\_" # 0-9 a-z A-Z - . _
+ urls = re.findall("http://.*?([" + HOSTNAMEDEF + "]+)/", someText)
+ if ((self.debugFilter) and (len(urls) > 0)):
+ sys.stdout.write("Found url hosts: %s\n" % urls)
+
+ for one_word in urls:
+ self.addOneTokenToDict("URLHOST:" + one_word, someDict)
+
+ ##################################################
+
def stripComments(self, someText):
"""
***************
*** 534,551 ****
found_words = self.myWordDefinition.split(someText)
for one_word in found_words:
! #the word has to have at least one alpha
! if ((one_word == '') or (not self.myCharDefinition.search(one_word))):
! continue
! if (len(one_word) > self.MAX_WORD_LENGTH):
! continue
! one_word = textType + one_word
! #sys.stderr.write("One word: %s\n" % one_word)
! word_count = someDict.get(one_word)
! try:
! someInt = int(word_count) + 1
! except:
! someInt = 1
! someDict[one_word] = someInt
! #print(self.someDict)
##################################################
--- 560,586 ----
found_words = self.myWordDefinition.split(someText)
for one_word in found_words:
! self.addOneTokenToDict(textType + one_word, someDict)
!
! ##################################################
!
! def addOneTokenToDict(self, someWord, someDict):
! """
! Add a single chunk of text to the dict
! """
! if (someWord == None):
! return;
!
! #the word has to have at least one alpha
! if ((someWord == '') or (not self.myCharDefinition.search(someWord))):
! return
! if (len(someWord) > self.MAX_WORD_LENGTH):
! return
! #sys.stderr.write("One word: %s\n" % someWord)
! word_count = someDict.get(someWord)
! try:
! someInt = int(word_count) + 1
! except:
! someInt = 1
! someDict[someWord] = someInt
##################################################
|
|
From: <kw...@us...> - 2003-05-05 15:44:40
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv16379
Modified Files:
TheSpamSecretary.py
Log Message:
Fixed a problem with empty subjects/bodies.
Index: TheSpamSecretary.py
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v
retrieving revision 1.14
retrieving revision 1.15
diff -C2 -d -r1.14 -r1.15
*** TheSpamSecretary.py 4 May 2003 20:33:58 -0000 1.14
--- TheSpamSecretary.py 5 May 2003 15:44:36 -0000 1.15
***************
*** 458,462 ****
outputData.write(aMessage)
#sys.stderr.write("Subject: %s\n" % aMessage.getheader('Subject'))
! self.addTokensFromTextToDict(aMessage.getheader('Subject'), self.tempDict, "SUBJECT:")
#print("MS:%s:ME" % outputData.getvalue())
#deal with mime messages
--- 458,463 ----
outputData.write(aMessage)
#sys.stderr.write("Subject: %s\n" % aMessage.getheader('Subject'))
! subject = aMessage.getheader('Subject')
! self.addTokensFromTextToDict(subject, self.tempDict, "SUBJECT:")
#print("MS:%s:ME" % outputData.getvalue())
#deal with mime messages
***************
*** 470,473 ****
--- 471,475 ----
except:
self.logFile.write("Failed to decode something of type %s\n" % aMessage.getmaintype())
+ self.logFile.write("Failed for subject %s\n" % aMessage.getheader('Subject'))
#print("MSXXX:%s:MEXXX" % outputData.getvalue())
self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
***************
*** 526,529 ****
--- 528,533 ----
textType is the type of text being added - '' for body text, SUBJECT: for subject text.
"""
+ if (someText == None):
+ return;
someText = someText.lower()
#print("scanning %s" % someText)
|
|
From: <kw...@us...> - 2003-05-04 20:34:01
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv19652
Modified Files:
TheSpamSecretary.py
Log Message:
Added subject tagging, which will double-count subject words (and mark them as SUBJECT:word in the keyvalue dicts).
Index: TheSpamSecretary.py
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** TheSpamSecretary.py 12 Apr 2003 17:33:38 -0000 1.13
--- TheSpamSecretary.py 4 May 2003 20:33:58 -0000 1.14
***************
*** 457,460 ****
--- 457,462 ----
outputData = StringIO.StringIO()
outputData.write(aMessage)
+ #sys.stderr.write("Subject: %s\n" % aMessage.getheader('Subject'))
+ self.addTokensFromTextToDict(aMessage.getheader('Subject'), self.tempDict, "SUBJECT:")
#print("MS:%s:ME" % outputData.getvalue())
#deal with mime messages
***************
*** 519,525 ****
##################################################
! def addTokensFromTextToDict(self, someText, someDict):
"""
! Find all the tokens in the text and add them to the given dict
"""
someText = someText.lower()
--- 521,528 ----
##################################################
! def addTokensFromTextToDict(self, someText, someDict, textType = ''):
"""
! Find all the tokens in the text and add them to the given dict.
! textType is the type of text being added - '' for body text, SUBJECT: for subject text.
"""
someText = someText.lower()
***************
*** 532,535 ****
--- 535,540 ----
if (len(one_word) > self.MAX_WORD_LENGTH):
continue
+ one_word = textType + one_word
+ #sys.stderr.write("One word: %s\n" % one_word)
word_count = someDict.get(one_word)
try:
***************
*** 634,638 ****
interestValue = .5 - one_prob
interestValue *= 2.0
! #print("%s %s %s" % (one_key, interestValue, one_prob))
if ((interestValue > leastInteresting) or (len(interestingListValues) < 15)):
#INSERT SORT - FIX ME - will sorting be a win?
--- 639,644 ----
interestValue = .5 - one_prob
interestValue *= 2.0
! if (self.debugFilter):
! print("%s %s" % (one_key, one_prob))
if ((interestValue > leastInteresting) or (len(interestingListValues) < 15)):
#INSERT SORT - FIX ME - will sorting be a win?
|
|
From: <kw...@us...> - 2003-04-12 17:33:43
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv5901
Modified Files:
TheSpamSecretary.py
Log Message:
Removed old references to cPickle.
Index: TheSpamSecretary.py
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v
retrieving revision 1.12
retrieving revision 1.13
diff -C2 -d -r1.12 -r1.13
*** TheSpamSecretary.py 1 Apr 2003 05:02:36 -0000 1.12
--- TheSpamSecretary.py 12 Apr 2003 17:33:38 -0000 1.13
***************
*** 326,334 ****
self.goodDict = anydbm.open(self.goodDictPath, "c") # myUnpickler.load()
self.goodMessageCount = int(self.goodDict['TeSSFileCount']) # myUnpickler.load())
- except cPickle.PickleError:
- print("failed to load good dict from %s" % self.goodDictPath)
- self.logFile.write("PickleError failed to load good dict from %s\n" % self.goodDictPath)
- self.dictFailure = 1
- self.goodDict = {}
except (IOError, EOFError):
self.logFile.write("%s %s failed to load good dict from %s\n" % (sys.exc_type, sys.exc_value, self.goodDictPath))
--- 326,329 ----
***************
*** 343,350 ****
self.badDict = anydbm.open(self.badDictPath, "c") # myUnpickler.load()
self.badMessageCount = int(self.badDict['TeSSFileCount']) # myUnpickler.load())
- except cPickle.PickleError:
- self.logFile.write("PickleError failed to load bad dict from %s\n" % self.badDictPath)
- self.dictFailure = 1
- self.badDict = {}
except (IOError, EOFError):
self.logFile.write("%s %s failed to load bad dict from %s\n" % (sys.exc_type, sys.exc_value, self.badDictPath))
--- 338,341 ----
|
|
From: <kw...@us...> - 2003-04-01 05:02:40
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv16016
Modified Files:
TheSpamSecretary.py
Log Message:
Strip comments out of multipart html.
Index: TheSpamSecretary.py
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** TheSpamSecretary.py 13 Jan 2003 00:15:49 -0000 1.11
--- TheSpamSecretary.py 1 Apr 2003 05:02:36 -0000 1.12
***************
*** 491,498 ****
--- 491,501 ----
#print("TYPE: %s" % onePart.gettype())
if (not (re.search("application", onePart.gettype()) or re.search("image", onePart.gettype()))):
+
try:
mimetools.decode(multiFile, outputData, onePart.getencoding())
except:#
self.logFile.write("Failed to decode something of type %s\n" % onePart.getencoding())
+ if (re.search("html", onePart.gettype())):
+ outputData = self.stripComments(outputData.getvalue());
#else:
# print("NO DECODE")
***************
*** 511,514 ****
--- 514,528 ----
#print("oneline: %s" % oneLine)
return(count)
+
+ ##################################################
+
+ def stripComments(self, someText):
+ """
+ Strip the comments from an html mime part. Returns a StringIO.
+ """
+ outputData = StringIO.StringIO()
+ commentRE = re.compile("<!--.*?-->", re.DOTALL | re.MULTILINE)
+ outputData.write(commentRE.sub('', someText))
+ return outputData
##################################################
|
|
From: <kw...@us...> - 2003-01-23 06:20:31
|
Update of /cvsroot/tess/homepage In directory sc8-pr-cvs1:/tmp/cvs-serv18998/homepage Modified Files: index.html Log Message: Minor homepage updates. Index: index.html =================================================================== RCS file: /cvsroot/tess/homepage/index.html,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** index.html 7 Jan 2003 04:54:41 -0000 1.4 --- index.html 23 Jan 2003 06:20:28 -0000 1.5 *************** *** 33,37 **** <a name="overview"> The Spam Secretary is an anti-spam mail filter based on the article at <a href="http://www.paulgraham.com/spam.html" target=_blank>http://www.paulgraham.com/spam.html</a>. It is a single file written in python and is easy to install. It is mailclient independent, and the hope is that it will be MDA independent as well. As long as you can specify a program to run when you deliver mail, have python installed, and your mail is stored in mbox format (maildir format forthcoming), you're all set.</a> <p> ! Please look <a href="http://www.sourceforge.com/projects/tess">here for files and documentations.</a> <p> <a name="works">In breif, the filter works by keeping track of how often a word appears in smap mail vs. non-spam mail. If a word often appears in spam, but only seldom appears in non-spam, the odds of that message being spam are greater. In this way, the 15 "most interesting" (most and least spam-probable) words are found in an incoming message and the message is sorted accordingly. This sounds so obvious and easy that it's hard to believe it really works, but it does.</a> --- 33,37 ---- <a name="overview"> The Spam Secretary is an anti-spam mail filter based on the article at <a href="http://www.paulgraham.com/spam.html" target=_blank>http://www.paulgraham.com/spam.html</a>. It is a single file written in python and is easy to install. It is mailclient independent, and the hope is that it will be MDA independent as well. As long as you can specify a program to run when you deliver mail, have python installed, and your mail is stored in mbox format (maildir format forthcoming), you're all set.</a> <p> ! Please look <a href="http://www.sourceforge.com/projects/tess">here for files and documentation.</a> <p> <a name="works">In breif, the filter works by keeping track of how often a word appears in smap mail vs. non-spam mail. If a word often appears in spam, but only seldom appears in non-spam, the odds of that message being spam are greater. In this way, the 15 "most interesting" (most and least spam-probable) words are found in an incoming message and the message is sorted accordingly. This sounds so obvious and easy that it's hard to believe it really works, but it does.</a> |
|
From: <kw...@us...> - 2003-01-13 00:15:53
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv26414
Modified Files:
TheSpamSecretary.py
Log Message:
Decode messages that are not multipart, but are base64 encoded.
Index: TheSpamSecretary.py
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** TheSpamSecretary.py 7 Jan 2003 04:55:19 -0000 1.10
--- TheSpamSecretary.py 13 Jan 2003 00:15:49 -0000 1.11
***************
*** 36,40 ****
import getopt
import ConfigParser
! import cPickle
import anydbm
import fcntl
--- 36,40 ----
import getopt
import ConfigParser
! #import cPickle
import anydbm
import fcntl
***************
*** 466,498 ****
outputData = StringIO.StringIO()
outputData.write(aMessage)
#deal with mime messages
#this should take care of most non-english messages, too
! if (re.match("multipart", aMessage.gettype())):
! #print(aMessage.gettype())
! multiFile = multifile.MultiFile(someFile)
! multiFile.push(aMessage.getparam("boundary"))
! while multiFile.next():
! onePart = mimetools.Message(multiFile)
! #print("TYPE: %s" % onePart.gettype())
! if (not (re.search("application", onePart.gettype()) or re.search("image", onePart.gettype()))):
! try:
! mimetools.decode(multiFile, outputData, onePart.getencoding())
! except:
! self.logFile.write("Failed to decode something of type %s\n" % onePart.getencoding())
! #else:
! # print("NO DECODE")
! #print(outputData.getvalue())
! self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
else:
! self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
! while (1):
! lastPosition = someFile.tell()
! oneLine = someFile.readline()
! if ((oneLine == '') or (self.fromLine.match(oneLine))):
! #if (aMessage.islast(oneLine)):
! someFile.seek(lastPosition)
! break
! self.addTokensFromTextToDict(oneLine, self.tempDict)
! #print("oneline: %s" % oneLine)
return(count)
--- 466,513 ----
outputData = StringIO.StringIO()
outputData.write(aMessage)
+ #print("MS:%s:ME" % outputData.getvalue())
#deal with mime messages
#this should take care of most non-english messages, too
! if (re.match("base64", aMessage.getencoding())):
! if (re.match("text", aMessage.getmaintype())):
! #sys.stderr.write("%s\n" % aMessage.getencoding())
! #sys.stderr.write("%s\n" % aMessage.getmaintype()) # "text"
! try:
! mimetools.decode(someFile, outputData, "base64")
! except:
! self.logFile.write("Failed to decode something of type %s\n" % aMessage.getmaintype())
! #print("MSXXX:%s:MEXXX" % outputData.getvalue())
! self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
! #print(aMessage.get_payload(None, 1))
! else:
! self.logFile.write("Not decoding something of type %s\n" % aMessage.getmaintype())
else:
! if (re.match("multipart", aMessage.gettype())):
! #print(aMessage.gettype())
! multiFile = multifile.MultiFile(someFile)
! multiFile.push(aMessage.getparam("boundary"))
! while multiFile.next():
! onePart = mimetools.Message(multiFile)
! #print("TYPE: %s" % onePart.gettype())
! if (not (re.search("application", onePart.gettype()) or re.search("image", onePart.gettype()))):
! try:
! mimetools.decode(multiFile, outputData, onePart.getencoding())
! except:#
! self.logFile.write("Failed to decode something of type %s\n" % onePart.getencoding())
! #else:
! # print("NO DECODE")
! #print(outputData.getvalue())
! self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
! else:
! self.addTokensFromTextToDict(outputData.getvalue(), self.tempDict)
! while (1):
! lastPosition = someFile.tell()
! oneLine = someFile.readline()
! if ((oneLine == '') or (self.fromLine.match(oneLine))):
! #if (aMessage.islast(oneLine)):
! someFile.seek(lastPosition)
! break
! self.addTokensFromTextToDict(oneLine, self.tempDict)
! #print("oneline: %s" % oneLine)
return(count)
|
|
From: <kw...@us...> - 2003-01-08 05:34:02
|
Update of /cvsroot/tess/TheSpamSecretary
In directory sc8-pr-cvs1:/tmp/cvs-serv23739
Modified Files:
README
Added Files:
COOKBOOK.html
Removed Files:
COOKBOOK
Log Message:
Wrote the COOKBOOK.html.
--- NEW FILE: COOKBOOK.html ---
<html>
<body>
<title>
TheSpamSecretary Cookbook
</title>
<center>
<h3>
TheSpamSecretary Cookbook
</h3>
</center>
So, you don't delete your read email? <a href="#procmail">Or you use procmail</a>, or some other MDA that expects filter agents to pipe to stdout and return an error condition? Or maybe <a href="#maildir">you use maildir?</a><br>
<br>
No problem.<br>
<pre>TheSpamSecretary.py --help</pre>
will display the command options available. Most commands may be run in one invocation, like the "standard forward" of<br>
<pre>"|TheSpamSecretary.py --filter --addgood=<path to deleted mail> --addbad=<path to deleted spam> --delete=1"
</pre>
Which will filter the email coming in from standard in and deliver it to either the inbox, or the spam box. It will also parse the contents at the deleted mail and deleted spam boxes and truncate those files.<br>
<br>
But you're here because that doesn't cover your use case.<br>
<br>
For the purposes of TheSpamSecretary, mail usage falls into 4 categories:
<ol>
<li><a href="#normal">You usually read and delete your mail as you're done with it.</a>
<li><a href="#hoard">You NEVER delete ANY mail.</a>
<li><a href="#save">You ONLY delete spam.</a>
<li><a href="#mixed">You always delete spam and delete some mail.</a>
</ol>
<a name="normal"><b>1. You delete mail.</b></a><br>
This is the basic use case. Just let TheSpamSecretary consume your delete boxes and you're set. <br>
<br>
<a name="hoard"><b>2. You hoard ALL mail.</b></a><br>
It is important that you never delete any mail, otherwise skip to #4. Otherwise you may start to mislabel some words that are generally only found in spam. This often is the case if you subscribe to commercial announcements from only a few companies. These often look a lot like spam, and if you delete them, you're at risk of not recognizing these words as valid. If this matches your use case, you're set - as long as you sort out your spam. <br>
<br>
Your forward should look like this: <br>
<pre>"|TheSpamSecretary.py --filter"
</pre>
On a regular basis you should update your dictionaries with a cronjob. What you need to do is delete and regenerate them. Something like: <br>
<pre>rm ~/.TheSpamSecretary.gooddict; TheSpamSecretary.py --addgood=<path to good box> --delete=0
rm ~/.TheSpamSecretary.baddict; TheSpamSecretary.py --addbad=<path to bad box> --delete=0
</pre>
Note that the good box path can be to a directory, in which case ALL the contents of that directory will be parsed (recursively). Note also that you can multiple adds if you want to specify multiple boxes, but you have other files in your mailbox directories. Something like:
<pre>find ~/mail/notspam/ -name "*.mbox" -exec TheSpamSecretary.py --addgood={} --delete=0 \;
</pre>
<a name="save"><b>3. You keep ALL good mail and delete SPAM</b></a><br>
Review #2 - the Mail hoarder - for the warning about NOT discarding ANY valid email. If this suits you, you have a couple of options: <br>
<ol TYPE=a>
<li>Do the same as #2, but truncate the spam file
<li>Constantly truncate/update spam and update the good dict as in #2
</ol>
In either case, you will be updating your good dict on a regular basis. Check out #2, but in general something like a cron job doing<br>
<pre>
rm ~/.TheSpamSecretary.gooddict; TheSpamSecretary.py --addgood=<path to good box> --delete=0
</pre>
For case a.: <br>
You will be updating your spam dict on a regular basis, but you will be truncating that file. Something like a cron job doing: <br>
<pre>
TheSpamSecretary.py --addbad=<path to bad box> --delete=1
</pre>
NOTE that you are NOT deleting the existing baddict, but that your ARE truncating the spam mailbox. <br>
For case b.: <br>
Your forward will look something like this: <br>
<pre>
"|TheSpamSecretary.py --filter --addbad=<path to bad box> --delete=1"
</pre>
This will add words from your VerifiedSpam box and truncate it every time you receive mail. You will also have to update your gooddict as noted above (on a regular basis). <br>
<br>
<a name="mixed"><b>4. You keep most mail</b></a><br>
This is the trickiest to define, as it depends a lot on how you handle your read mail. Assuming you discard spam, you should probably use the forward described in case #3, above: <br>
<pre>
"|TheSpamSecretary.py --filter --addbad=<path to bad box> --delete=1"
</pre>
Maintaining your gooddict is trickier. If you delete some mail, but move most of your read email to a "Read" box, you could go a couple of routes. If you have already accumulated a lot of email, you could just use the standard route of truncating both your good and bad deleted mbox's, and when you start, manually add all your current good email using <br>
<pre>TheSpamSecretary.py --addgood=<path to good box> --delete=0
</pre>
For each of your good boxes. This should probably be good enough to keep TeSS running smoothly indefinately. If you're willing to do a little more work to keep up with valid email that has been moved to your 'Read' box, you could do something like <br>
<pre>cat <path to Read box> >> <path to Read.parsed> TheSpamSecretary.py --addgood=<path to Read box> --delete=1
</pre>
This will append your Read box to a Read.parsed box, then TheSpamSecretary will truncate your Read box. Note that using the cat command may not work for all mailbox formats/imap server/whatever - you should certainly test this before truncating your Read box!<br>
<br>
<center>
<font size=+2> <a name="procmail">So you use Procmail (or some other chaining MDA)</a> </font>
</center>
<br>
If you set your inboxpath in your .TheSpamSecretary.config file to -
<pre>inboxpath = -</pre>
Non spam mail will be piped to stdout instead of delivered to a file. If you set your spamboxpath in your .TheSpamSecretary.config file to -
<pre>spamboxpath = -</pre>
Spam mail will be piped to stdout as well, AND TheSpamSecretary will return an exit code of 1, as opposed to 0 for non-spam. If you can't allow TheSpamSecretary to exit with code 1, I recommend you specify a tempfile for spam delivery and rm -f the file before TheSpamSecretary and cat it after TheSpamSecretary is done.<br>
<br>
<center>
<font size=+2> <a name="maildir">So you use maildir format</a> </font>
</center>
<br>
Hrm. Well, I haven't used maildir since around 1988, though I loved the format. It turns out that the clients I use, and the imapd server I use do not. Yes, I know they are available.<br>
Here's the good news: TheSpamSecretary should work fine with maildir. Here's the bad new: I don't know how it works, exactly. Specifically, I don't know how maildir MDA's work. I assume they are like procmail in that they chain through stdout. You should read the Procmail notes if that is the case. I don't know how you deliver spam to a different maildir than non-spam, either. I'm hoping that the exit status will allow you to do that. I would appreciate feedback from ANYONE willing to test this stuff out!<br>
More good news: TheSpamSecretary handles maildir directories fine, from a reading perspective. If you specify a directory in your TheSpamSecretary commands:<br>
<pre>"|TheSpamSecretary.py --filter --addgood=/home/YOURNAME/mail/Deleted/ --addbad=/home/YOURNAME/mail/VerifiedSpam/ --delete=1"
</pre>
TheSpamSecretary will parse all the files (recursively) in those directories.<br>
<br>
<font color=red>WARNING!!!</font><br>
<b>--delete=1 WILL RECURSIVELY DELETE ALL FILES IN MAILDIR MODE.</b><br>
<font color=red>WARNING!!!</font><br>
<br>
That is, all the files in /home/YOURNAME/mail/Deleted/ and /home/YOURNAME/mail/VerifiedSpam/ <b>WILL BE DELETED. RECURSIVELY.</b> If you do something like soft link your home directory into your Deleted mail directory, TheSpamSecretary will delete all your files. ALL YOUR FILES. Hell, if you soft link / to your Deleted mail directory, TheSpamSecretary will delete YOUR WHOLE SYSTEM (as much as it can, and if your MDA runs as root for some reason (misconfigured, etc), it will DELETE YOUR WHOLE SYSTEM). Is that clear? DELETE YOUR FILES. It MAY BE that you would like to use --delete=0 and clean up the files to be deleted some other way. You should look at the various recipes above for notes on how you could manage your Deleted and VerifiedSpam directories. Probably you want to write a cron script that parses those directories and then deletes them using some maildir command that is smart about not DELETING ALL YOUR FILES because of a symlink. </pre>
</body>
</html>
Index: README
===================================================================
RCS file: /cvsroot/tess/TheSpamSecretary/README,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** README 7 Jan 2003 04:55:19 -0000 1.1
--- README 8 Jan 2003 05:33:59 -0000 1.2
***************
*** 7,11 ****
every time you receive new mail!
! If this does not match your usage pattern, check out the COOKBOOK
To install if you don't use procmail!!!:
--- 7,11 ----
every time you receive new mail!
! If this does not match your usage pattern, check out the COOKBOOK.html
To install if you don't use procmail!!!:
***************
*** 48,52 ****
again and change --delete=0 to --delete=1 . THIS WILL CONSUME YOUR Deleted
MAILBOX AND VerifiedSpam MAILBOX EVERY TIME YOU RECEIVE MAIL. If this is not
! the behavior you want, please check out the COOKBOOK.
If you did not have a store of deleted mail to be consumed, I recommend copying
--- 48,52 ----
again and change --delete=0 to --delete=1 . THIS WILL CONSUME YOUR Deleted
MAILBOX AND VerifiedSpam MAILBOX EVERY TIME YOU RECEIVE MAIL. If this is not
! the behavior you want, please check out the COOKBOOK.html.
If you did not have a store of deleted mail to be consumed, I recommend copying
--- COOKBOOK DELETED ---
|
|
From: <kw...@us...> - 2003-01-07 04:55:22
|
Update of /cvsroot/tess/TheSpamSecretary In directory sc8-pr-cvs1:/tmp/cvs-serv1858 Modified Files: TheSpamSecretary.py Added Files: COOKBOOK README Log Message: Added exit status, - goes to stdout for procmail and friends. --- NEW FILE: COOKBOOK --- Full of good recipes! (to come) --- NEW FILE: README --- This assumes you use mbox format and do not use procmail for mail delivery. It also assumes the following email usage pattern: Incoming mail goes to your inbox, or spam box depending on status. You move read mail to a Deleted box. You move verified spam to a VerifiedSpam box. You want both your Deleted box and VerifiedSpam box truncated to length 0 every time you receive new mail! If this does not match your usage pattern, check out the COOKBOOK To install if you don't use procmail!!!: (as root) cp TheSpamSecretary.py /usr/local/bin [or wherever you like] (as you) python /usr/local/bin/TheSpamSecretary.py This will generate the needed .TheSpamSecretary... files in your home directory. Then you must edit ~/.TheSpamSecretary.config MAKE SURE that your inboxpath is the correct path to your mail inbox. The same goes for your spam box - make sure it exists and is correct. Then edit (don't forget to keep a backup if you already have one) ~/.forward It should look like this (include the "'s): "|/<pathto>/TheSpamSecretary.py --filter --addbad=<path to your verified spam box> --addgood=<path to your deleted good mail box> --delete=0" This will filter the incoming message and also add the contents of the bad box to the bad dictionary, and the good box to the good dictionary. It will not truncate the files when it is done. TEST THIS BY SENDING YOURSELF MAIL. If it arrives in your inbox, things are going well. If it doesn't, you can check the cookbook and try to debug it, or ask for help on the user list at http://lists.sourceforge.net/mailman/listinfo/tess-users. Otherwise, you should remove/revert your .forward file. If this worked, you're well on your way! If you had deleted messages in your Deleted box, you can make sure they were loaded by doing something like TheSpamSecretary.py --showgood | more Similarly, you can try TheSpamSecretary.py --showbad | more to check any spam messages. Once you're convinced incoming mail is working, you should edit your .forward again and change --delete=0 to --delete=1 . THIS WILL CONSUME YOUR Deleted MAILBOX AND VerifiedSpam MAILBOX EVERY TIME YOU RECEIVE MAIL. If this is not the behavior you want, please check out the COOKBOOK. If you did not have a store of deleted mail to be consumed, I recommend copying (make sure it is COPY and not move) your inbox to your deleted box. That way TeSS gets an idea of what you think of as not spam. TeSS will not filter mail until it has parsed at least 40 GOOD messages. Otherwise it tends to get a lot of false positives early on. After that, it seems to be smooth sailing! Index: TheSpamSecretary.py =================================================================== RCS file: /cvsroot/tess/TheSpamSecretary/TheSpamSecretary.py,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** TheSpamSecretary.py 5 Jan 2003 21:21:10 -0000 1.9 --- TheSpamSecretary.py 7 Jan 2003 04:55:19 -0000 1.10 *************** *** 32,37 **** #-------------------------------------------------------------------------- - #import mailbox import array import getopt import ConfigParser --- 32,37 ---- #-------------------------------------------------------------------------- import array + import atexit import getopt import ConfigParser *************** *** 67,70 **** --- 67,71 ---- Generate the filter dict. Proceed according to the commands. """ + self.exitStatus = 0 self.goodDict = {} self.goodMessageCount = 0 *************** *** 91,102 **** username = 'YOUR_NAME' self.inboxPath = os.path.join("/var/mail", username) ! self.deleteBoxPath = os.path.join(user.home, "mail/Deleted") self.spamBoxPath = os.path.join(user.home, "mail/Spam") ! self.verifiedspamBoxPath = os.path.join(user.home, "mail/VerifiedSpam") self.newGoodPath = None self.newBadPath = None self.filter = 0 self.loadedMessageCount = 0 self.__loadConfig() self.logFile = open(os.path.join(self.home, "TheSpamSecretary.log"), "a") --- 92,106 ---- username = 'YOUR_NAME' self.inboxPath = os.path.join("/var/mail", username) ! #self.deleteBoxPath = os.path.join(user.home, "mail/Deleted") self.spamBoxPath = os.path.join(user.home, "mail/Spam") ! #self.verifiedspamBoxPath = os.path.join(user.home, "mail/VerifiedSpam") self.newGoodPath = None self.newBadPath = None self.filter = 0 self.loadedMessageCount = 0 + + atexit.register(self.exitstatus) + # Start initialization from loading files self.__loadConfig() self.logFile = open(os.path.join(self.home, "TheSpamSecretary.log"), "a") *************** *** 105,108 **** --- 109,113 ---- self.__loadDictionaries() + # Start actually doing things if (self.newGoodPath and not self.dictFailure): self.addToDict(self.goodDict, self.goodMessageCount, self.goodDictPath, self.newGoodPath) *************** *** 169,173 **** """ try: ! (opts, args) = getopt.getopt(cmd[1:], "", ["addgood=", "addbad=", "config=", "delete=", "filter", "showgood", "showbad", "showfilter", "debugFilter", "license"]) except: sys.stderr.write(""" --- 174,178 ---- """ try: ! (opts, args) = getopt.getopt(cmd[1:], "", ["addgood=", "addbad=", "config=", "delete=", "filter", "showgood", "showbad", "debugFilter", "license"]) except: sys.stderr.write(""" *************** *** 192,196 **** --showgood will dump the good dict to stdout and exit --showbad will dump the bad dict to stdout and exit - --showfilter will dump the filter dict to stdout and exit --debugFilter will scan text from Standard Input and print the spam match dict --- 197,200 ---- *************** *** 221,232 **** self.__loadDictionaries() self.__showDict(self.badDict) - if (o == "--showfilter"): - self.__loadConfig() - self.__loadDictionaries() - # self.__generateFilterDict() - # print(self.filterDict) - print("Good messages read: %d" % self.goodMessageCount) - print("Bad messages read: %d" % self.badMessageCount) - if (o == "--debugFilter"): self.__loadConfig() --- 225,228 ---- *************** *** 287,291 **** needConfig = 0 if (os.access(self.configPath, os.R_OK) == 0): ! sys.stderr.write("ERROR: %s does not exist or is unreadable. Exiting...\n" % self.configPath) needConfig = 1 --- 283,288 ---- needConfig = 0 if (os.access(self.configPath, os.R_OK) == 0): ! sys.stderr.write("ERROR: %s does not exist or is unreadable.\n" % self.configPath) ! sys.stderr.write(" Writing a new one. Please make sure it is correct.\n") needConfig = 1 *************** *** 306,312 **** self.badDictPath = self.__getDefault(configparser, "badDictPath", defaultValue = self.badDictPath) self.inboxPath = self.__getDefault(configparser, "inboxPath", defaultValue = self.inboxPath) ! self.deleteBoxPath = self.__getDefault(configparser, "deleteBoxPath", defaultValue = self.deleteBoxPath) self.spamBoxPath = self.__getDefault(configparser, "spamBoxPath", defaultValue = self.spamBoxPath) ! self.verifiedspamBoxPath = self.__getDefault(configparser, "verifiedspamBoxPath", defaultValue = self.verifiedspamBoxPath) self.deleteFiles = self.__getDefault(configparser, "deleteFiles", defaultValue = self.deleteFiles) try: --- 303,309 ---- self.badDictPath = self.__getDefault(configparser, "badDictPath", defaultValue = self.badDictPath) self.inboxPath = self.__getDefault(configparser, "inboxPath", defaultValue = self.inboxPath) ! #self.deleteBoxPath = self.__getDefault(configparser, "deleteBoxPath", defaultValue = self.deleteBoxPath) self.spamBoxPath = self.__getDefault(configparser, "spamBoxPath", defaultValue = self.spamBoxPath) ! #self.verifiedspamBoxPath = self.__getDefault(configparser, "verifiedspamBoxPath", defaultValue = self.verifiedspamBoxPath) self.deleteFiles = self.__getDefault(configparser, "deleteFiles", defaultValue = self.deleteFiles) try: *************** *** 384,394 **** #sys.stdin.seek(0) #tempFile = sys.stdin ! if (final_odds < .9): #print("looks clean") ! destFile = open(self.inboxPath, "a") else: #print("looks like spam") ! destFile = open(self.spamBoxPath, "a") ! fcntl.lockf(destFile.fileno(), fcntl.LOCK_EX); while 1: oneLine = tempFile.readline() --- 381,399 ---- #sys.stdin.seek(0) #tempFile = sys.stdin ! if (final_odds < .90): #print("looks clean") ! if (self.inboxPath == "-"): ! destFile = sys.stdout ! else: ! destFile = open(self.inboxPath, "a") ! fcntl.lockf(destFile.fileno(), fcntl.LOCK_EX); else: #print("looks like spam") ! if (self.spamBoxPath == "-"): ! destFile = sys.stdout ! self.exitStatus = 1 ! else: ! destFile = open(self.spamBoxPath, "a") ! fcntl.lockf(destFile.fileno(), fcntl.LOCK_EX); while 1: oneLine = tempFile.readline() *************** *** 398,402 **** tempFile.close() os.remove(tempFileName) ! destFile.close() #print(interest_dict.keys().sort()) #print(interest_dict) --- 403,408 ---- tempFile.close() os.remove(tempFileName) ! if (destFile != sys.stdout): ! destFile.close() #print(interest_dict.keys().sort()) #print(interest_dict) *************** *** 570,573 **** --- 576,587 ---- ################################################## + def exitstatus(self): + """ + Looks like this is not needed, but I'm keeping it for now... + """ + #print("Over and %d" % self.exitStatus) + + ################################################## + def scanMessageFile(self, messageFile): """ *************** *** 637,640 **** --- 651,655 ---- boxScanner = TheSpamSecretary() + sys.exit(boxScanner.exitStatus) ## EOF ## |
|
From: <kw...@us...> - 2003-01-07 04:54:44
|
Update of /cvsroot/tess/homepage In directory sc8-pr-cvs1:/tmp/cvs-serv1706/homepage Modified Files: index.html Log Message: More homepage stuff Index: index.html =================================================================== RCS file: /cvsroot/tess/homepage/index.html,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** index.html 6 Jan 2003 02:14:46 -0000 1.3 --- index.html 7 Jan 2003 04:54:41 -0000 1.4 *************** *** 24,27 **** --- 24,29 ---- <li><a href="#well">How well does it work?</a> <li><a href="#use">How do I use it?</a> + <li><a href="#what">What's it do?</a> + <li><a href="#why">Why another program?</a> <li><a href="http://sourceforge.net/projects/tess">SourceForge page</a> </ul> *************** *** 38,41 **** --- 40,47 ---- <p> <a name="use">The program is really rather flexible, and can be used in several different ways. The only real requirements are that you need to have python (I use 2.2) and you have to be able to invoke a python script to deliver mail. That way the script can put the incoming message into one of two boxes: either "regular incoming", or "spam incoming". After that, the options vary a lot depending on how YOU read and handle your mail. What I do is save all messages that are verified spam (I've checked to make sure the messages in the spam box are really spam, or I move spam that was not correctly identified) to a "VerifiedSpam" mailbox. Regular email that I'm done with, I just delete, which moves it to the "Deleted" mailbox. Then, every time I get a new message, the contents of "VerifiedSpam" and "Deleted" are consumed (parsed and deleted) by TeSS and added to the appropriate word database. All of this is handled in the single command in the .forward file.</a> + <p> + <a name="details">The fine details... Assuming my configuration as described <a href="#use">above</a>: Incoming mail is [mime decoded and then] broken into tokens - each token is (lowercased) a-z, 0-9, "-", and "'". Each token must have at least one letter in it. Each token may be at most 20 characters. These tokens are then looked up as described in <a href="http://www.paulgraham.com/spam.html" target=_blank>Paul Graham's paper</a>. If it is spam (> 80% probable), it is put in the spam box. If it is not, it is put in the Inbox. All messages in the Deleted box are parsed the same way, and each token is added to the "good words" file (some db file according to your platform - dmb, gdbm, whatever python decides). Then that mail file is truncated to 0 length. The same thing happens to the messages in the VerifiedSpam mailbox. That's all there is to it... I check my spam box every once in a while to make sure no good messages got dumped, and then I move them all to the VerifiedSpam box to be consumed. Regular email I delete as usual (which my mail program moves to a Deleted box). Eventually I may decide to dump Spam directly to VerifiedSpam without looking at it.</a> + <p> + <a name="why">Why did I write this where there are so many other anti-spam programs out ther? There were 2 compelling reasons: I use multiple clients, so I needed a server-based solution; I don't like or want to mess with qmail or procmail, which virtually all other server solutions seem to want to use.</a> <p>   <!-- add a spacer --> |
|
From: <kw...@us...> - 2003-01-06 02:14:49
|
Update of /cvsroot/tess/homepage In directory sc8-pr-cvs1:/tmp/cvs-serv32177 Modified Files: index.html Log Message: Added comment in html file mentioning this file's location Index: index.html =================================================================== RCS file: /cvsroot/tess/homepage/index.html,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** index.html 6 Jan 2003 02:13:00 -0000 1.2 --- index.html 6 Jan 2003 02:14:46 -0000 1.3 *************** *** 1,3 **** --- 1,4 ---- <html> + <!-- This file is located in the CVS repository in the "homepage" project. --> <head> <title> |
|
From: <kw...@us...> - 2003-01-06 02:13:02
|
Update of /cvsroot/tess/homepage In directory sc8-pr-cvs1:/tmp/cvs-serv31747 Modified Files: index.html Added Files: README Log Message: Massive update and added README --- NEW FILE: README --- The htdoc root is at shell.sourceforge.net:/home/groups/t/te/tess/htdocs Index: index.html =================================================================== RCS file: /cvsroot/tess/homepage/index.html,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** index.html 5 Jan 2003 21:36:00 -0000 1.1.1.1 --- index.html 6 Jan 2003 02:13:00 -0000 1.2 *************** *** 1,13 **** <html> ! <head> ! <title>The Spam Secretary</title> ! </head> ! <body> ! <h1>The Spam Secretary</h1> ! There is nothing here yet.<br> ! Please look <a href="http://www.sourceforge.com/projects/tess">here.</a> ! <address> ! <a href="mailto:kw...@po...">Kurt Werle</a> ! </address> ! </body> </html> --- 1,75 ---- <html> ! <head> ! <title> ! The Spam Secretary ! </title> ! </head> ! <body> ! <table width=100%> ! <tr> ! <td> ! </td> ! <td align=center> ! <font size=+2>The Spam Secretary</font> ! </td> ! <td> ! </td> ! </tr> ! <tr> ! <td valign=top align=left> ! <ul> ! <li><a href="#overview">Overview</a> ! <li><a href="#works">How Does it Work?</a> ! <li><a href="#well">How well does it work?</a> ! <li><a href="#use">How do I use it?</a> ! <li><a href="http://sourceforge.net/projects/tess">SourceForge page</a> ! </ul> ! </td> ! <td> ! <p> ! <a name="overview"> The Spam Secretary is an anti-spam mail filter based on the article at <a href="http://www.paulgraham.com/spam.html" target=_blank>http://www.paulgraham.com/spam.html</a>. It is a single file written in python and is easy to install. It is mailclient independent, and the hope is that it will be MDA independent as well. As long as you can specify a program to run when you deliver mail, have python installed, and your mail is stored in mbox format (maildir format forthcoming), you're all set.</a> ! <p> ! Please look <a href="http://www.sourceforge.com/projects/tess">here for files and documentations.</a> ! <p> ! <a name="works">In breif, the filter works by keeping track of how often a word appears in smap mail vs. non-spam mail. If a word often appears in spam, but only seldom appears in non-spam, the odds of that message being spam are greater. In this way, the 15 "most interesting" (most and least spam-probable) words are found in an incoming message and the message is sorted accordingly. This sounds so obvious and easy that it's hard to believe it really works, but it does.</a> ! <p> ! <a name="well">How well does it work? Once you have built up a reasonable database of words, I've seen it work at least 95% effective. What's more, I've NEVER had it assign a "false positive" - guessing something is spam that is not.</a> ! <p> ! <a name="use">The program is really rather flexible, and can be used in several different ways. The only real requirements are that you need to have python (I use 2.2) and you have to be able to invoke a python script to deliver mail. That way the script can put the incoming message into one of two boxes: either "regular incoming", or "spam incoming". After that, the options vary a lot depending on how YOU read and handle your mail. What I do is save all messages that are verified spam (I've checked to make sure the messages in the spam box are really spam, or I move spam that was not correctly identified) to a "VerifiedSpam" mailbox. Regular email that I'm done with, I just delete, which moves it to the "Deleted" mailbox. Then, every time I get a new message, the contents of "VerifiedSpam" and "Deleted" are consumed (parsed and deleted) by TeSS and added to the appropriate word database. All of this is handled in the single command in the .forward file.</a> ! <p> !   <!-- add a spacer --> ! </td> ! <td> ! </td> ! </tr> ! <tr> ! <td> ! </td> ! <td> ! <table> ! <tr> ! <td> ! TESS is ! </td> ! <td> ! <a HREF="http://www.python.org/"> <img SRC="http://starship.python.net/crew/just/pythonpowered/PythonPowered.gif" ALIGN=top WIDTH=110 HEIGHT=44 ALT="Python Powered" BORDER=0></a> ! </td> ! <td> ! Hosted by ! </td> ! <td> ! <A href="http://sourceforge.net/projects/tess"> <IMG src="http://sourceforge.net/sflogo.php?group_id=62844&type=5" width="210" height="62" border="0" alt="SourceForge.net Logo"></A> ! </td> ! </tr> ! </table> ! </td> ! <td valign=bottom align=right> ! Written by ! <address> ! <a href="mailto:kw...@po...">Kurt Werle</a> ! </address> ! </td> ! </tr> ! </table> ! </body> </html> |