From: Roman S. <rn...@on...> - 2004-04-29 03:33:22
|
On Tue, 27 Apr 2004, Tom Allison wrote: >I recently subscribed to this list not because I'm a pyzor user (yet) >but I'm trying to understand how it works. > >I was trying to find out how this email tagging and reporting system >works, in general, from the razor-agent mailing list, but they have a >rather closed door. Because of that and the apparent attentions paid to >the commercial product, at the expense of their GPLed product, it leaves >me nonplussed. > >So here I am and here's my question: > > From what I can find, in general the reporting of spam consistes of >turning the BODY of the message into a MD5 type hash string and >reporting that signature. > >So I played with it a bit and from my 3780 spams in my archive, I found >90 of them actually held my name and/or email address (or some part >thereof). This was the only quick way that I could see if they actual >BODY had been customized for my delivery (aren't they thoughtful!). >This works out to ~0.2% of my received spam. > >I did not do anything to strip HTML or MIME-decode or uuencoding or >anything like that. A message in any form would still hash to a unique >ID and I wasn't trying to be that exact. > >So, I guess my question is, how do you compensate reporting spam that >has a unique tag included in the body to still provide some degree of >spam identification that is worth sharing? Your answer can be found in the Pyzor source code. The code is easy to read, and you will see that before making hash Pyzor removes URLs, spaces, and other things. But I must admit that large amount of spam today uses randomizing techniques which Pyzor does not like. So the only solution is to use rule-based spam-catcher (like SpamAssassin) + Bayes filter (like spamoracle, etc) + some SMTP-level euristics to catch spam. Probably banning HTML emails will have devastating effect on spam too ;-) Sincerely yours, Roman Suzi -- rn...@on... =\= My AI powered by GNU/Linux RedHat 7.3 |