|
From: Greg B. <gbr...@bi...> - 2003-05-01 20:39:46
|
On Thu, May 01, 2003 at 02:46:43PM -0400, Keith Jackson wrote: > > As I stated in the previous mail, pyzor doesn't hash the > > entire mail, just > > a section of it after removing data that would likely contribute to > > non-uniqueness, such as whitespace, urls, and email > > addresses. The exact > > rules for removing suspicious elements are described in > > pyzor.client.DataDigester. > > So, browsing the source, studying the rules, I can compose a spam mail that > will easily defeat this system. So then, not to be rude or putting-down, > this is mostly useless. People started blocking key words like 'cock'. So > the spammers use 'c0ck'. People start using this system, the spammers will > read your source and get around it. So, rather than a real solution to spam, > this sounds more like one step in the cat and mouse game. If you're only interested in perfect solutions, you're going to be reading a lot of spam over the next few years. > So, keeping two months, and some efficient storage of hashes, such as not > duplicating them across spam mail entries, and using compression for > server/client communication, I don't think it's really too impractical. I think you're optimizing for a pretty narrow class of client - someone with a fast, unmetered net connection (so it's not burdensome to have to download all of the spam messages and upload the hash chains) and a relatively fast PC (to calculate a few hundred MD5's per message), with access to a pretty nice server (to store a few hundred MD5's per message per customer). The server becomes more useful as it's got more customers (someone's gotta read the spam the first time, and say "hey! that's a spam!", unless you're going to depend on spam-trap addresses), so it's not so interesting to say "well, four of my friends and I will share a little server on an old PC I've got in my garage". Now, if we think about other people who care about spam - like, say, people with dialup access or metered access who don't want to download spams only to discard them on the client PC - or ISP's who don't want their spool disks filled with spams waiting to be delivered then discarded - then there's a pretty significant processing load placed on the receiving mailserver, if they're the ones that have to calculate the hashes everytime a message is received. > I'd give a few more bytes for better spam protection. Have you done the math on this? I don't think we're talking about "a few more bytes", I think we're talking about a fair amount of data, if you're planning to store an ordered list of 128-bit values for every message received over the last 60 days for a few thousand, or tens of thousands, of people. Your initial message in this thread, stripped of its headers, and counted by "wc", had 327 words. Assuming I have a nice way to store the hashes that doesn't incur any overhead, that's 5232 bytes of data to remember your message. If I get 300 emails per day like yours (probably not too far off the mark), and I want to store 60 days' worth of them, that's almost 92 megs of data .. for one person's email. Now, I'm probably worse than the average, but that's still a few orders of magnitude worse than seems practical. (The storage requirements would be reduced quite a bit if you didn't want to keep the order of the hashes, but that would also reduce the ability of the program to differentiate between messages .. or if we didn't care about privacy, and stored the message itself - the same data from your message, whose hashes totalled 5232 bytes of data, was only 1993 bytes as text.) And all of this assumes that you have a good algorithm for deciding whether two hash chains are similar enough to be the same, or not; I think you're making the problem a lot harder by trying to operate on hashes, rather than text, because it's easier to write code that ignores garbage strings than code that ignores hashes of garbage strings. Further, the privacy protection provided by the hash isn't very good - what's to keep a nosy server operator from running MD5 over the contents of a few good dictionaries, and then substituting the known hashes for the contents of the messages you disclose? Sure, they'd probably end up with some missing words, but messages written in known languages would be revealed pretty quickly. (Unless you use some salt, so that when I hash a word I get a different result than when you hash that word .. but then it's not possible to compare my messages received to your messages received, and notice interesting parallels.) > I just don't think pyzor is going to work for me as it stands now. And if I > were writing a mass mailer program that they advertise via mass mail ;) I'd > design it to beat this. > > I do wish you much luck with this project though. Anything open source for > fighting spam is a Good Thing (tm). Good luck to you, too .. -- Greg Broiles gbr...@pa... |