Re: Digesting and Spam-matching

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Fri, May 02, 2003 at 09:03:11AM -0400, Keith Jackson wrote:
> 
> 5232 bytes = 327 hashes * 128 bits
> THAT, you are correct, is impractical, and not optimized or compressed even
> remotely.
> 
> Just as my thinking aloud in these emails, you could store them as
> references to a dictionary. The reference is going to be less than 128 bits.
> An index number, let's say 4 bytes... an unsigned long. The dictionary won't
> grow forever, it will change based on the messages currently on the server.
> So..
> 
> 1308 bytes = 327 hashes * 4 byte refs
> 
> That's 25% storage of what you are talking about. A 75% gain. And I'm not
> even a wiz at this stuff.

If 32 bits of search space is sufficient, and the MD5 step isn't providing
any significant privacy, why bother with it at all? 

Seems like you could reach a similar result just shipping the software with
a pre-compiled dictionary matching common words to your 4-byte references,
and pass chains of those around, instead. 

> Secondly, see above comment about not storing duplicates. Storing all email
> everyone gets is just silly.

Ok, but aren't you degrading your ability to detect spam, then? I thought we
were designing for the case where nobody sends identical spams (see your
critique of Pyzor's hash strategy) but instead sends individualized spams
with minor differences. So we've either got to store all of them, or hope
that the first spam that we get is similar enough to all of the others to
avoid detection. 

> Again, I'm no expert, but the algorithm being hard shouldn't be a reason for
> not doing it. I'm sure as I'm sitting here there are PhD's all over that
> have papers on the net about good algorithms of this type.

I think this is what I find hard to swallow about your suggestion - that
we just find some smart people from somewhere else to fix the storage 
problem and the spam-matching problem, and then say that the spam problem
has been solved. 

Isn't that like saying "I know how to stop SARS. We'll make some pills 
that everyone will take. That might be expensive, but we'll get some
smart manufacturing guys to help make it cheap. And we don't have an
anti-viral drug that works against SARS yet, but there are a bunch of
smart pharmacology guys who work on that sort of thing all day, they'll
get that figured out pretty soon. OK, problem solved. Next!"

It doesn't really seem reasonable to me to compare hypothetical software
to actual software - the actual software always has bugs & limitations,
while the hypothetical software never seems to have any, because it can
be modified much more quickly than actual software, and is always
fully debugged & optimized. 

--
Greg Broiles
gbr...@pa...