Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#20 review cache logic

open
nobody
None
5
2010-08-04
2010-08-04
Cedric Knight
No

Caching SpamAssassin results based on a MD5 digest of the body obviously saves a huge amount of processing time. I've noticed though that sometimes the results from the cache are wrong because, as the source says, "spam level and spam report may be influenced by a mail header section too, not only by a mail body".

For instance, when spam first comes via a trusted IP or mailing list, then SA's RBL lookups will not apply to subsequent copies of the spam, and so they too low a score (admittedly, it could work the other way around). Or the actual values of RBL or URIBL lookups may change over time (DNS entries usually cached for an hour).

So ideally if there is a match of body digest found, then SA body checks would be skipped, but "header" and "meta" and if possible even "uridnsbl" rules could be re-evaluated. (Would this also need some additional hooks in the SA code?)

Another thing in the code that looks suboptimal is that when there is a continuous spam flood, the status of the mail being received is not re-evaluated, as the cache expiry is set as (default) 600s into the future each time. It seems better to recheck Razor, iXspam as well as the RBL and URIBL lookups every 10 minutes at the cost of minimal CPU (or clear the cache virus results when definitions are uploaded). That is, if ($spam_presence_checked) skip the $body_digest_cache->set. Similarly, it might be useful to have a mechanism to clear cached negative virus results as soon as virus signatures are updated.

Discussion