From: Tony M. <to...@sp...> - 2009-09-07 22:40:39
|
Hi Alpana, > What types of emails should be whitelisted? IMO: * 'Bugs': ham where the hash of the message is identical to the hash of another (unrelated) message, either because of a flaw in the code (e.g. the one mentioned in ticket #57 and in this list at various times), or because the digest algorithm isn't selecting sufficiently unique text to hash. This is the only case where single-recipient messages would be worth reporting. * 'False positives': ham that has already been processed with Pyzor, which isn't sent to a single recipient, and had a non-zero hit count. IOW someone has incorrectly reported the message as spam (it's unfortunately common for people to hit the 'spam' button for automated messages that they subscribed to but are no longer interested in; this could also occur if someone is automatically feeding messages to Pyzor; it could also be a true mistake). If you are the only recipient, then there isn't really any value in whitelisting, but if there is even one other, then it's possible that you'll whitelist before their message is checked, and you'll save them from the false positive. An average user is not going to be able to distinguish between these two, but they are distinct to me as (a) a Pyzor developer, and (b) someone using Pyzor to filter many users' mail. * 'My mail': this is controversial, and no-one is doing this on public.pyzor.org at the moment (but might be in their own servers). If you're sending out messages to multiple recipients, then you could whitelist the message before you actually send it (this could be done automatically by your local SMTP server). Since headers aren't used in the hash, the hash would be the same pre-sending as on arrival. You'd then be ensuring that Pyzor didn't detect your messages. Of course, if the spammers did this, it would not be good! (Whitelisting typically requires an account on the Pyzor server, to avoid that). * 'True negatives': i.e. you have some other method(s) of determining if a multiple-recipient message is ham/spam, and if it comes out as ham, then you report it to Pyzor, either without checking Pyzor, or even when Pyzor does not consider it spam. I'm not certain this is a good idea - it increases the resources required for the Pyzor server, and you're saying that you really trust these other classification methods (because they'll override any Pyzor result). However, if you assume that the other classification never has false negatives, and you're willing to commit the additional server resources, then this would avoid false positives (from bugs or user error) for anyone classifying their mail after you do. Something that you could look at (if you get suitable corpora of email) is the time distribution of identically-hashed messages (ignoring any caused by bugs). If identical-hash messages generally arrive in a small period of time (e.g. several hours), then the value of manual whitelisting based on user feedback is diminished (because it is likely that by the time it is done all the messages have been classified already). In that case, only automated whitelisting would have much value. However, if they arrive over a long period of time (e.g. batches occurring over a week), then there is value in a manual whitelist operation. Cheers, Tony |