Re: Digesting and Spam-matching

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thu, May 01, 2003 at 02:46:43PM -0400, Keith Jackson wrote:
> > As I stated in the previous mail, pyzor doesn't hash the
> > entire mail, just
> > a section of it after removing data that would likely contribute to
> > non-uniqueness, such as whitespace, urls, and email
> > addresses.  The exact
> > rules for removing suspicious elements are described in
> > pyzor.client.DataDigester.
> 
> So, browsing the source, studying the rules, I can compose a spam mail that
> will easily defeat this system. So then, not to be rude or putting-down,
> this is mostly useless. People started blocking key words like 'cock'. So
> the spammers use 'c0ck'. People start using this system, the spammers will
> read your source and get around it. So, rather than a real solution to spam,
> this sounds more like one step in the cat and mouse game.

If you're only interested in perfect solutions, you're going to be reading a 
lot of spam over the next few years.

> So, keeping two months, and some efficient storage of hashes, such as not
> duplicating them across spam mail entries, and using compression for
> server/client communication, I don't think it's really too impractical.

I think you're optimizing for a pretty narrow class of client - someone with
a fast, unmetered net connection (so it's not burdensome to have to download
all of the spam messages and upload the hash chains) and a relatively fast
PC (to calculate a few hundred MD5's per message), with access to a pretty
nice server (to store a few hundred MD5's per message per customer). The 
server becomes more useful as it's got more customers (someone's gotta read
the spam the first time, and say "hey! that's a spam!", unless you're 
going to depend on spam-trap addresses), so it's not so interesting to say
"well, four of my friends and I will share a little server on an old PC
I've got in my garage".

Now, if we think about other people who care about spam - like, say, people
with dialup access or metered access who don't want to download spams only
to discard them on the client PC - or ISP's who don't want their spool 
disks filled with spams waiting to be delivered then discarded - then there's
a pretty significant processing load placed on the receiving mailserver,
if they're the ones that have to calculate the hashes everytime a message
is received. 

> I'd give a few more bytes for better spam protection.

Have you done the math on this? I don't think we're talking about "a few
more bytes", I think we're talking about a fair amount of data, if you're
planning to store an ordered list of 128-bit values for every message
received over the last 60 days for a few thousand, or tens of thousands,
of people.

Your initial message in this thread, stripped of its headers, and counted 
by "wc", had 327 words. Assuming I have a nice way to store the hashes
that doesn't incur any overhead, that's 5232 bytes of data to remember
your message. If I get 300 emails per day like yours (probably not too
far off the mark), and I want to store 60 days' worth of them, that's
almost 92 megs of data .. for one person's email. Now, I'm probably 
worse than the average, but that's still a few orders of magnitude worse
than seems practical.

(The storage requirements would be reduced quite a bit if you didn't want
to keep the order of the hashes, but that would also reduce the 
ability of the program to differentiate between messages .. or if we
didn't care about privacy, and stored the message itself - the same data
from your message, whose hashes totalled 5232 bytes of data, was only
1993 bytes as text.)

And all of this assumes that you have a good algorithm for deciding
whether two hash chains are similar enough to be the same, or not; 
I think you're making the problem a lot harder by trying to operate on
hashes, rather than text, because it's easier to write code that 
ignores garbage strings than code that ignores hashes of garbage 
strings. 

Further, the privacy protection provided by the hash isn't very good -
what's to keep a nosy server operator from running MD5 over the
contents of a few good dictionaries, and then substituting the known
hashes for the contents of the messages you disclose? Sure, they'd
probably end up with some missing words, but messages written in 
known languages would be revealed pretty quickly. (Unless you use
some salt, so that when I hash a word I get a different result than
when you hash that word .. but then it's not possible to compare
my messages received to your messages received, and notice interesting
parallels.)

> I just don't think pyzor is going to work for me as it stands now. And if I
> were writing a mass mailer program that they advertise via mass mail ;) I'd
> design it to beat this.
> 
> I do wish you much luck with this project though. Anything open source for
> fighting spam is a Good Thing (tm).

Good luck to you, too ..

--
Greg Broiles
gbr...@pa...