Hi
I noticed that pyzor recently received a feature to locally whitelist
digests - thanks a lot for this! This got me thinking of how the root
cause of these 'digest collisions' could be fixed (see
https://github.com/SpamExperts/pyzor/issues/3 ).
I came up with a few ideas - but maybe I'm completely wrong about this,
so im hoping we could discuss it on the mailing list before I submit
any actual code.
#1 : Don't feed "invalid" data to the normalizer
The normalizer assumes to be working with text (it strips urls, emails,
unique identifiers, ...) but pyzor also feeds *undecoded* non-text
attachments. (https://github.com/SpamExperts/pyzor/blob/master/pyzor/digest.py#L173-L175)
This causes messages with little or no body text and a base64 encoded
attachment to end up with a single line of base64 padding data used for
creating the digest.
Ideas:
- ignore non-text parts completely ( probably requires code to detect
text parts which don't actually have a text/* content-type header)
- use a hash of all non-text parts to update the digest and don't feed
this hash to the normalizer
#2 : If a message body only contains a links/short text lines/very long
words (identifiers), the normalizer strips away everything and creates
a digest of the empty string which causes digest collisions.
Let's say we have a body like this:
"buy: http://shop1.example.com"
In the current implementation, pyzor strips away both the url and the
text (because the text is shorter than 8 characters) and the message
gets the same digest as one with an empty body.
Idea: Don't *remove* dynamic parts like emails/urls/identifiers but
replace them with static markers like '[EMAIL ADDRESS]',
'[URLPATTERN]' , '[LONG STRING]' in order to keep the text structure
intact. So, in the example above, we'd actually hash the string
"buy:[URLPATTERN]"
#3 : Add more uniqueness to empty body messages
Even with #2 implemented, there are situations where the digester
simply doesn't get any data. For example, we see a lot of people
sending themselves reminder mails with no body ("Subject: 'buy milk'")
Idea: if there is no usable body data, use the Subject header to create
the digest
What do you think?
Best regards
Oli
--
message transmitted on 100% recycled electrons
|