As a proxy, ASSP passes through most of your host mail transport’s security features and vulnerabilities. It also represents a running service accepting connections from the Internet public. Perl in general has a good track record of offering few vulnerabilities. As a proxy, ASSP’s only input/output is socket based, so that limits its exposure. ASSP never opens files with user-inputted names and never shells to the operating system. In a *nix environment you will want to use ASSP’s ability to run as a non-root user. You may also consider running it in a chroot jail. To do this set the ChangeRoot variable in the configuration to set to your ASSP directory and copy (or link) the /etc/protocols file into a etc/protocol file in the ASSP directory.
The collections of spam and non-spam email may represent a security risk, and access should be restricted to mail administrators. The non-spam email collection will certainly contain sensitive correspondence, and steps should be taken to protect it from those who don’t require access.
Your administration password is transmitted with basic authentication (ie no encryption). If you plan to use the web interface from a host where you feel sniffing is a possibility I’d recommend installing stunnel (www.stunnel.org) to create an encrypted tunnel for your web-admin sessions. The password is stored in plain text in the assp.cfg file -- make sure file permissions protect this file from read access for unauthorized users. You can also add ip addresses to the Allow Admin Connections From configuration entry to restrict access to the admin interface, although this type of packet is quite easy to spoof.
ASSP uses a sophisticated parsing filter to work around most spammer tricks to disguise their content. As content-based filters like ASSP become more common spammers may find ways to better disguise their message. I personally do not believe spammers will win that battle, but it’s hard to say for sure.
If everyone we email gets added to the ASSP whitelist, won’t spammers just use an address from the whitelist to spam us? It is possible, but more difficult than it sounds. Addresses from your local site aren’t added to the whitelist, so a spammer will have to find someone your site emails. That list will be different for every site using ASSP. A better strategy would be for the spammer to trick you into emailing him/her. But that too will only work for one site at a time. Ultimately it is possible for the spammer to use this strategy to spam your site, but she/he will have to do the same thing individually for every site running ASSP. If this becomes a problem we will develop an appropriate defense.
ASSP has been designed with great care to prevent this from happening. The whitelist is the single most powerful tool to prevent this – anyone you email will never have a message blocked. The spam filter keeps track of mail we send and spam we receive -- if an incoming message is not from someone we've emailed and it's more like the mail we send than the spam we receive then it gets through. Otherwise it's blocked and the sender gets the message, "Mail appears to be unsolicited -- report errors to postmaster@ourhost.com." The type of email that most often falls in this category is confirmation emails from web sites. Often these mails are only as personal as your email address and contain a lot of advertising – they look a lot more like spam than they look like the mail you send. If someone has a good idea how to recognize this type of email please let me know.
Now that ASSP supports the "Expression to recognize non-spam" you can use that to help recognize these confirmation emails. Often they'll include your address, phone number, or other personal information that spam never includes. You can build a "regular expression" to recognize some of these.
At this point ASSP looks for words built from A-Z and the symbols from \240-\377 and separated by spaces. (It’s a little more complicated than that, but that’s basically it.) If your language is mostly that way then ASSP will work fine – Spanish, French, German, Polish, etc, primarily use the Latin alphabet and should work fine. Korean, Japanese, and Chinese don’t work well. Future plans may include improvements to make them more functional. Wwe have active users working in Spanish, French, and German without problems.
One message per file. Only the first 10k bytes are significant. Keep attachments attached – ASSP parses them up to the first 10k. Separate collections are kept in separate folders. Largely whitespace and headers (except the subject) are ignored. Edit, delete, or add files and rebuild the database – that’s about all there is to it. Files that have numbers as filenames will randomly be overwritten over time keeping the collection up-to-date and limited in size. As of version 0.3.4 ASSP also began to track helo phrases passed in the SMTP conversation -- see the format of the ASSP received header line to see how this should be formatted.
ASSP's CPU and memory load are quite moderate. Excluding rebuilding the databases, ASSP uses fewer CPU cycles per message than our mail transport does and significantly fewer per message than our virus filter software.
Beyond the Spam Lovers and Redlist, per-user settings are beyond the scope of ASSP’s design goals. They’re generally pretty hard to implement in the SMTP Proxy environment.
No. The rebuildspamdb script can run without stopping ASSP.
Here's a matrix to help identify the differences: [ filtered mail | unfiltered mail ] x [ contributes to whitelist | doesn't contribute ] = filtered & contributes = normal unfiltered & contributes = spamlover filtered & doesn't contribute = redlist (does contribute to spam/nonspam collections) unfiltered & doesn't contribute = no processing (also doesn't contribute to spam/nonspam collections)
When a mail client connects to a mail server to send mail it must send a SMTP command, "HELO" (or the variant EHLO) followed by what it calls itself. Almost every server uses its host name in this greeting: m11.lax.untd.com for example. However spammers often greet with a random string of letters: slk845gjlkas perhaps. ASSP tries to recognize these greetings because they're an excellent indicator of spaminess. Unfortunately, a bug in versions prior to 0.3.5 meant that all messages without a header are interpreted as randomhelo greetings (or rndhelo).
When you install ASSP a colony of super-intelligent thermophilus bacteria takes up residence on your CPU and begin reading all your email. They communicate using radio waves directly with the CPU and interface with the ASSP software choosing between spam and nonspam mail. If you choose to read further this myth will be sadly dispelled, and we take no responsibility for the consequences. However, you can always refer your clients to this page to prove to them that their email is actually being filtered by super-intelligent bacteria. The rebuildspamdb program is where I will start. It reads the files in your errors/spam, errors/notspam, spam and notspam directories. As it reads the files in the errors directory it also builds a hash of the mail body to be able to identify duplicate messages misfiled. This hash is used to delete messages from the notspam collection that were also in the errors/spam collection and from the spam collection that were also in the errors/notspam collection. Think of it like scrubbing bubbles – they do the work so you don’t have toooo!
As rebuildspamdb reads the files it also does two things. First it runs a filter (the subroutine “clean”) that prepares the message for statistical analysis. Second it walks through the file tallying word pairs in the spam or not-spam categories according to the collection. Files in the errors/spam collection count double; files in the errors/spam count x4.
The “clean” subroutine does a number of important operations. Primarily its function is to undo the things spammers do to trick filters. It cleans up base64 encoding. It cleans up many HTML obfuscation techniques. Look at the code of the “sub clean” for more details – it’s all commented. It also does two other things (and may do more in the future) to help the Bayesian analysis. First, it inserts a keyword after each word of the subject – this lets the Bayesian filter recognize words in the subject uniquely. For example the word “free” in the subject will have a different Bayesian rating than the word “free” in the body of the message. Second it does a couple of tricks to isolate the “HELO” greeting that was sent when the message was delivered. This has also proven to be a useful Bayesian factor in identifying spam.
Paul Graham’s “A Plan for Spam” recommends complete header analysis within the Bayesian filter. Because ASSP initially used three-keyword identifiers, and now two-keyword identifiers, I found this useless. However, header analysis will be a fruitful area of development for improving ASSP’s spam / ham recognition rate in the future. That will take place in the “clean” subroutine. There may be other pre-processing features that will be introduced there in the future.
Once each mail message is pre-processed (cleaned) each word pair is tallied (words being defined as [-\$A-Za-z0-9\'\.!\240-\377]+ – shorter than 2 or longer than 19 are ignored and are further cleaned in this way: s/[,.']+$//; s/!!!+/!!/g; s/--+/-/g;) [Sorry for the technical stuff for those allergic to it.] So that in the end you end up with a big database of word pairs and their counts: “in the”: spam=23210, total=46411; “order now”: spam=20001, total=20121. The rebuildspamdb program then steps through this database discarding identifiers with total less than 5 (i.e. if a word pair occurred 4 or fewer times in all the collections combined and with errors/spam x2, and errors/spam x4 then the pair can be ignored) and calculating the spaminess ratio this way:
If the spam count = 0 or the spam count = the total count then square both counts. (This amplifies factors which appear only in the spam or not-spam collection.)
Spaminess = (spam count + 1) / (total count + 2) (This should look familiar to anyone with a basic understanding of Bayesian filters. It also somewhat de-emphasizes rare identifiers and emphasizes common ones.)
Throw out the identifier if it’s between 0.41 and 0.59 – this identifier appears almost equally in both spam and non-spam there’s no point in keeping it.
Force the result between 0.999999 and 0.000001 – Bayesian classifiers croak if the value is too close to 0 or 1.
All of these results are sorted (by identifier) and stored in the spamdb for use by ASSP.
Rebuildspamdb also randomly (1 time in 20) prunes outdated entries in the whitelist and goodhosts databases.
Now you know how the spamdb is built, so let’s see how it is used.
Suppose a mailer in the internet connects to ASSP. ASSP makes a connection to your “SMTP Destination” and begins relaying their conversation. It notes the IP address of the connecting server. It notes their HELO string. It notes their MAIL FROM (envelope sender). It notes their RCPT TOs. It notes their DATA directive. (This is all in sub “getline”.) Relay attempts are blocked. The presence of spam bucket addresses is noted. Mail to the email interface is detected. Mail to no-processing or “spam lover” addresses is noted. Assuming none of that qualifies the message is passed on to “getheader.”
Getheader is looking for the mail header. When the header is complete getheader calls “onwhitelist” which determines if the message should be treated as whitelisted/local (it’s the same really) and if so to update the whitelist. If not processing goes on to “getbody.”
Getbody reads the rest of the message (or the first 10000 bytes including the header, which ever comes first), checks for attached executables (if that’s enabled) and calls “isspam” which is probably why you’re reading this document.
The isspam subroutine first checks WhiteRe and BlackRE, the expressions to identify non-spam and spam, respectively. Then it calls “clean” to clean up any spammer obfuscation, and calls them again with the “cleaned” version. Then it checks for a DNSBL hit, which adds 0.97 twice to the list of Bayesian factors for this message. Then it checks for a goodhost miss, which adds whatever your site’s goodhost factor is twice, provided it is > 0.65. Then it walks through the message’s word pairs, just like rebuildspamdb did, completing the list of Bayesian factors. Unlike rebuildspamdb, an identifier hit will only be counted a maximum of two times, so if the identifier “free money” rates 0.955 and “free money” occurs three or more times in the mail message, only the first two count.
The list of factors is sorted and the thirty factors closest to 0 or 1 (i.e. the 30 furthest from 0.5 or neutral) are combined as Bayes taught into a single probability. If this probability is greater than 0.6 the message is spam. (Mail is very rarely between 0.2 and 0.8 – it’s almost always > 0.9 or