From: Tony M. <to...@sp...> - 2009-06-02 21:37:18
|
> 1) how can the server get a wl count of 10000 at the first place for > an obvious piece of spam? (I have a handle of similar emails that > falls into the same case); This digest (778941d994b5281bf5652cd293a2761421cc109d) is a special case. Ticket #1037314 deals with this case. Basically, the only content that Pyzor finds to use for digesting for some types of message is: # pyzor predigest < ~/message.eml <!DOCTYPEHTMLPUBLICHTML4.0 Obviously, this isn't text that would be unique to a message. Until that ticket is resolved, both ham and spam can end up with the same digest. With a classifier like Pyzor (where the digests are meant to be unique), it is many times worse to get a false positive than a false negative. For that reason, I manually set the whitelist count to 10,000 for this one digest, so that until the ticket is resolved, messages of this type will never be classified as spam. That means that there will be a few spam that are missed, but no ham will be incorrectly classified as spam, which is vastly more important. > 2) the client seems to override the end result with even a whitelist > count of 1, judging from the source code. That's correct - this was also the case in 0.4 - I believe it has been true ever since Frank originally added the whitelist functionality. That's a decade before my time, but my guess would be that he felt that the whitelist functionality was necessary, but wanted to ensure that existing tools (perhaps the SpamAssassin plugin) continued to work. For example, the current SpamAssassin plug-in (which could well be the same code as when 0.4 was released) ignores the whitelist count completely. That means that unless the hit count is adjusted, the whitelisting would have no effect. Since authentication is required for the whitelist command, and a false positive is vastly worse (especially with a hash-based classifier) than a false negative, it seems a reasonable choice. Looking forward, my feeling (as outlined on the list previously), is that adding a new command ("score"), which combined the hit and whitelist counts to produce a 0-1 score, would be a useful addition. This would allow a more refined use of the two counts. I don't think it's right to adjust the current behaviour of the "check" command, since it has behaved that way for so long. If users wish to make use of the individual counts, they they can either do a check command without using the standard pyzor client (since it is the client that overrules the hit count, not the server), or use the info command and parse the result accordingly. Cheers, Tony |