RE: Digesting and Spam-matching

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Keith Jackson, on 2003-05-01, wrote:

> So, browsing the source, studying the rules, I can compose a spam mail
> that will easily defeat this system. So then, not to be rude or
> putting-down, this is mostly useless. People started blocking key words
> like 'cock'. So the spammers use 'c0ck'. People start using this system,
> the spammers will read your source and get around it. So, rather than a
> real solution to spam, this sounds more like one step in the cat and
> mouse game.

You're somewhat right.  I fully understand the limitations of pyzor.  So
far, however, the simple solution has worked well, and scaled well.  Once
we need more, it won't be that hard to change pyzor so that what is
digested is done more dynamically.  Eventually, however, I fully realize
that we could get to a point where all pieces of spam are drastically
unique, and can't be compared to each other to any significant degree.

Pyzor was designed to solve a problem as it existed and could be solved at
the time.  If the problem is borgish, and adapts to preventions such as
pyzor and razor, then these solutions will likely fade away, which is
perfectly fine with me.  Against a skilled, determined, and funded
attacker, preventing spam could become a *much* more difficult problem.

> I disagree. While not as compact and quick, it will catch more spam,
> which is the primary goal. Google has the whole world cached. I'm sure
> storing spam emails is not that impractical. Besides, it doesn't have to
> keep it forever. Spams that were sent out two months ago, are not likely
> to be looked up, how many people don't check their email in two months?

First of all, Google has $$$, bandwidth, and many machines.  Storing spam
emails isn't that impractical, really.  I'm pretty sure the Razor servers
store the entire piece of spam, but when checking, only a digest is sent
to be queried.  This nicely allows dynamic rules about hashing, which is a
good idea, requires significant more work on the part of mass mailer
developers to thwart.

> I just don't think pyzor is going to work for me as it stands now. And
> if I were writing a mass mailer program that they advertise via mass
> mail ;) I'd design it to beat this.

If you or I was so inclined, you and I could probably both design mass
mailers that would defeat any preventive measures, especially if said
measures had source available.  The only tricky one to defeat might be
Bayesian-like systems, which might be the only long-term winner.

-- 
Frank Tobin			http://www.neverending.org/~ftobin/