From: Oli S. <py...@li...> - 2014-11-25 11:00:29
|
Hi I noticed that pyzor recently received a feature to locally whitelist digests - thanks a lot for this! This got me thinking of how the root cause of these 'digest collisions' could be fixed (see https://github.com/SpamExperts/pyzor/issues/3 ). I came up with a few ideas - but maybe I'm completely wrong about this, so im hoping we could discuss it on the mailing list before I submit any actual code. #1 : Don't feed "invalid" data to the normalizer The normalizer assumes to be working with text (it strips urls, emails, unique identifiers, ...) but pyzor also feeds *undecoded* non-text attachments. (https://github.com/SpamExperts/pyzor/blob/master/pyzor/digest.py#L173-L175) This causes messages with little or no body text and a base64 encoded attachment to end up with a single line of base64 padding data used for creating the digest. Ideas: - ignore non-text parts completely ( probably requires code to detect text parts which don't actually have a text/* content-type header) - use a hash of all non-text parts to update the digest and don't feed this hash to the normalizer #2 : If a message body only contains a links/short text lines/very long words (identifiers), the normalizer strips away everything and creates a digest of the empty string which causes digest collisions. Let's say we have a body like this: "buy: http://shop1.example.com" In the current implementation, pyzor strips away both the url and the text (because the text is shorter than 8 characters) and the message gets the same digest as one with an empty body. Idea: Don't *remove* dynamic parts like emails/urls/identifiers but replace them with static markers like '[EMAIL ADDRESS]', '[URLPATTERN]' , '[LONG STRING]' in order to keep the text structure intact. So, in the example above, we'd actually hash the string "buy:[URLPATTERN]" #3 : Add more uniqueness to empty body messages Even with #2 implemented, there are situations where the digester simply doesn't get any data. For example, we see a lot of people sending themselves reminder mails with no body ("Subject: 'buy milk'") Idea: if there is no usable body data, use the Subject header to create the digest What do you think? Best regards Oli -- message transmitted on 100% recycled electrons |