ASSP Bayesian Filter
ASSP Bayesian Filter - Theory of Operation
Every time a message passes through your SMTP server it has a from address and one or more to addresses. Your SMTP server also knows if the message is being sent from your local network (and to allow relaying for that message), or if it's coming from outside (and must be delivered to a local address). Your local users do not send unsolicited email (right?) and the people they correspond with would only send you solicited email. In fact the people they email would also be unlikely to send UCE. By monitoring these addresses ASSP builds a web of trust . Local users are trusted, the addresses in their TO or CC fields are trusted. Any email from these people is considered not-spam without further checking. (Note this is not a good strategy for virus containment, but it is a good strategy for UCE.) Users of the local mail domains are not added to the whitelist. They are identified by being a part of the local network. Many spammers forge a from addresses with the same domain as the to address, so it is important to avoid adding local addresses to the whitelist. With only a few days of operation you should see your whitelist grow to more than 1000 addresses. The whitelist is not only helpful in identifying non-spam, but in building your database of non-spam emails. The whitelist is automatically saved every $UpdateWhitelist seconds (1 hour by default). Spambuckets are addresses which receive only spam. They can be integrated on your web site, posted on Usenet, or come naturally by having employees leave your site; after a reasonable period of time bouncing their mail all mail received for these addresses can be considered unsolicited. Any email whose sender is not whitelisted and is addressed to a spambucket is classified as spam. Spambuckets are helpful both in identifying spam, and in building and maintaining your spam database. Finally, if an email comes and is not addressed from someone not on your local network, nor on the whitelist, nor addressed to a spambucket, it is compared to the statistical profile generated by the Bayesian filter. The Bayesian filter works by looking for words and phrases (up to three words long) that occur significantly more often in either your non-spam collection, or your spam collection. For most organizations spam identifiers include things like "get rich quick" while non-spam identifiers are things like your organization's full name or address, or personal names of people who work there. They also include considerably more subtle references like HTML tags which spammers prefer, or jargon specific to your line of business. To classify a new email all the words and phrases in the first 10000 bytes of the email (including the header) are checked against the statistical model. The top 50 ranking words and phrases are combined according to Bayes theorem to predict how well the mail compares to spam / non-spam in your collections.
After an email is classified as local or whitelisted, or as Bayesian spam or spam to a spambox its first 10000 bytes are are saved in the appropriate collection directory. It is given a random number between 0 and MaxFiles (14000 by default) and written to that file name. In this way older files will gradually (randomly) be replaced with newer files, thus keeping the collections both diverse and up-to-date. Files in the errors folders (correctedspam and correctednotspam) are never overwritten.