Re: what should be whitelisted?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Alpana,

> What types of emails should be whitelisted?

IMO:

 * 'Bugs': ham where the hash of the message is identical to the hash
of another (unrelated) message, either because of a flaw in the code
(e.g. the one mentioned in ticket #57 and in this list at various
times), or because the digest algorithm isn't selecting sufficiently
unique text to hash.  This is the only case where single-recipient
messages would be worth reporting.

 * 'False positives': ham that has already been processed with Pyzor,
which isn't sent to a single recipient, and had a non-zero hit count.
IOW someone has incorrectly reported the message as spam (it's
unfortunately common for people to hit the 'spam' button for automated
messages that they subscribed to but are no longer interested in; this
could also occur if someone is automatically feeding messages to
Pyzor; it could also be a true mistake).  If you are the only
recipient, then there isn't really any value in whitelisting, but if
there is even one other, then it's possible that you'll whitelist
before their message is checked, and you'll save them from the false
positive.

An average user is not going to be able to distinguish between these
two, but they are distinct to me as (a) a Pyzor developer, and (b)
someone using Pyzor to filter many users' mail.

 * 'My mail': this is controversial, and no-one is doing this on
public.pyzor.org at the moment (but might be in their own servers).
If you're sending out messages to multiple recipients, then you could
whitelist the message before you actually send it (this could be done
automatically by your local SMTP server).  Since headers aren't used
in the hash, the hash would be the same pre-sending as on arrival.
You'd then be ensuring that Pyzor didn't detect your messages.  Of
course, if the spammers did this, it would not be good! (Whitelisting
typically requires an account on the Pyzor server, to avoid that).

 * 'True negatives': i.e. you have some other method(s) of determining
if a multiple-recipient message is ham/spam, and if it comes out as
ham, then you report it to Pyzor, either without checking Pyzor, or
even when Pyzor does not consider it spam.  I'm not certain this is a
good idea - it increases the resources required for the Pyzor server,
and you're saying that you really trust these other classification
methods (because they'll override any Pyzor result).  However, if you
assume that the other classification never has false negatives, and
you're willing to commit the additional server resources, then this
would avoid false positives (from bugs or user error) for anyone
classifying their mail after you do.

Something that you could look at (if you get suitable corpora of
email) is the time distribution of identically-hashed messages
(ignoring any caused by bugs).  If identical-hash messages generally
arrive in a small period of time (e.g. several hours), then the value
of manual whitelisting based on user feedback is diminished (because
it is likely that by the time it is done all the messages have been
classified already).  In that case, only automated whitelisting would
have much value.  However, if they arrive over a long period of time
(e.g. batches occurring over a week), then there is value in a manual
whitelist operation.

Cheers,
Tony