Re: Whitelist count = 10000?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 1) how can the server get a wl count of 10000 at the first place for
> an obvious piece of spam? (I have a handle of similar emails that
> falls into the same case);

This digest (778941d994b5281bf5652cd293a2761421cc109d) is a special
case.  Ticket #1037314 deals with this case.  Basically, the only
content that Pyzor finds to use for digesting for some types of
message is:

# pyzor predigest < ~/message.eml
<!DOCTYPEHTMLPUBLICHTML4.0

Obviously, this isn't text that would be unique to a message.  Until
that ticket is resolved, both ham and spam can end up with the same
digest.  With a classifier like Pyzor (where the digests are meant to
be unique), it is many times worse to get a false positive than a
false negative.  For that reason, I manually set the whitelist count
to 10,000 for this one digest, so that until the ticket is resolved,
messages of this type will never be classified as spam.  That means
that there will be a few spam that are missed, but no ham will be
incorrectly classified as spam, which is vastly more important.

> 2) the client seems to override the end result with even a whitelist
> count of 1, judging from the source code.

That's correct - this was also the case in 0.4 - I believe it has been
true ever since Frank originally added the whitelist functionality.

That's a decade before my time, but my guess would be that he felt
that the whitelist functionality was necessary, but wanted to ensure
that existing tools (perhaps the SpamAssassin plugin) continued to
work.  For example, the current SpamAssassin plug-in (which could well
be the same code as when 0.4 was released) ignores the whitelist count
completely.  That means that unless the hit count is adjusted, the
whitelisting would have no effect.  Since authentication is required
for the whitelist command, and a false positive is vastly worse
(especially with a hash-based classifier) than a false negative, it
seems a reasonable choice.

Looking forward, my feeling (as outlined on the list previously), is
that adding a new command ("score"), which combined the hit and
whitelist counts to produce a 0-1 score, would be a useful addition.
This would allow a more refined use of the two counts.  I don't think
it's right to adjust the current behaviour of the "check" command,
since it has behaved that way for so long.  If users wish to make use
of the individual counts, they they can either do a check command
without using the standard pyzor client (since it is the client that
overrules the hit count, not the server), or use the info command and
parse the result accordingly.

Cheers,
Tony