|
From: Greg B. <gbr...@bi...> - 2003-05-02 14:28:42
|
On Fri, May 02, 2003 at 09:03:11AM -0400, Keith Jackson wrote: > > 5232 bytes = 327 hashes * 128 bits > THAT, you are correct, is impractical, and not optimized or compressed even > remotely. > > Just as my thinking aloud in these emails, you could store them as > references to a dictionary. The reference is going to be less than 128 bits. > An index number, let's say 4 bytes... an unsigned long. The dictionary won't > grow forever, it will change based on the messages currently on the server. > So.. > > 1308 bytes = 327 hashes * 4 byte refs > > That's 25% storage of what you are talking about. A 75% gain. And I'm not > even a wiz at this stuff. If 32 bits of search space is sufficient, and the MD5 step isn't providing any significant privacy, why bother with it at all? Seems like you could reach a similar result just shipping the software with a pre-compiled dictionary matching common words to your 4-byte references, and pass chains of those around, instead. > Secondly, see above comment about not storing duplicates. Storing all email > everyone gets is just silly. Ok, but aren't you degrading your ability to detect spam, then? I thought we were designing for the case where nobody sends identical spams (see your critique of Pyzor's hash strategy) but instead sends individualized spams with minor differences. So we've either got to store all of them, or hope that the first spam that we get is similar enough to all of the others to avoid detection. > Again, I'm no expert, but the algorithm being hard shouldn't be a reason for > not doing it. I'm sure as I'm sitting here there are PhD's all over that > have papers on the net about good algorithms of this type. I think this is what I find hard to swallow about your suggestion - that we just find some smart people from somewhere else to fix the storage problem and the spam-matching problem, and then say that the spam problem has been solved. Isn't that like saying "I know how to stop SARS. We'll make some pills that everyone will take. That might be expensive, but we'll get some smart manufacturing guys to help make it cheap. And we don't have an anti-viral drug that works against SARS yet, but there are a bunch of smart pharmacology guys who work on that sort of thing all day, they'll get that figured out pretty soon. OK, problem solved. Next!" It doesn't really seem reasonable to me to compare hypothetical software to actual software - the actual software always has bugs & limitations, while the hypothetical software never seems to have any, because it can be modified much more quickly than actual software, and is always fully debugged & optimized. -- Greg Broiles gbr...@pa... |