Menu

#108 Avoid duplicates in SPAM folder

open
nobody
None
5
2003-12-27
2003-12-27
No

One possible method to optimise the operation of the SPAM filter - specifically the storage of SPAM emails in the SPAM folder by avoiding the storage of duplicate emails.

Quite often, there will be many duplicate SPAM messages in the SPAM folder because the sources of the SPAM are the same. Furthermore the content from these senders may not change significantly or at all over a period of time. Repeated isolation of these messages leads to the storage of essentially duplicate messages in the SPAM folder. For the purposes of regenerating the spam dictionary, the duplicates serve no useful purpose as long as a count of how many times those spam words have been received has been maintained.   

My idea therefore is to keep only one copy of duplicate emails in the SPAM folder. To save having a potentially computationally intensive duplicate removal algorithm, this can be achieved by generating a word based checksum from the content of each SPAM email. Duplicate content emails will generate the same
checksum thereby allowing a) the duplicate to be isolated and deleted instead of being stored in the SPAM folder and b) for the offending email to still be used to bump up the SPAM word count in the SPAM dictionary.   

Obviously you may be able to come up with a better way of achieving the same goal.

Kulwant

Discussion


Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.