RE: Digesting and Spam-matching

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> If 32 bits of search space is sufficient, and the MD5 step
> isn't providing
> any significant privacy, why bother with it at all?
> Seems like you could reach a similar result just shipping the
> software with
> a pre-compiled dictionary matching common words to your
> 4-byte references,
> and pass chains of those around, instead.

True. But either way, this is a small detail. Send each actual token, or
send a hash of each token. I guess it's a warm fuzzy feeling. By my own
arguments then, it's just bloat, and the actual token should be sent. Either
way. I guess I'd do MD5 because it's simple to implement, and would make it
at least mildly difficult for an 'evil server'. As I said before, spam and
security are separate issues. If you want security, use PGP or some other
package designed specifically for it.

> > Secondly, see above comment about not storing duplicates.
> > Storing all email everyone gets is just silly.
>
> Ok, but aren't you degrading your ability to detect spam,
> then? I thought we
> were designing for the case where nobody sends identical
> spams (see your
> critique of Pyzor's hash strategy) but instead sends
> individualized spams
> with minor differences. So we've either got to store all of
> them, or hope
> that the first spam that we get is similar enough to all of
> the others to
> avoid detection.

My thoughts were the 'first' spam contains something like:

Keith, you've won!!!

3 tokens: [Keith,], [you've], [won!!!]

Now you get a second spam mail, and it has [Greg,] instead of [Keith,] it
will match with our 'theoretical' matching software the spam message stored
on the server... the [Keith,] token will have it's 'variable' attribute
incremented by one.

So now the next match will be stronger. It will see that [Keith,], [you've],
[won!!!] matches better with [George,], [you've], [won!!!] because it sees
that that first token is variable. I don't want to store each unique spam, I
want to store the first unique spam, which will then be modified via voting
and whatnot to be more like a spam template, thus the point of doing this on
the server and harnessing the power of distributed computing, vs. a client
trying to figure this stuff out.

Another thought that comes to mind just writing this email, to be even more
optimized, I can hash the whole email on the client side (ignoring headers),
and come up with one hash. Send it to the server, if that hash matches, then
we don't even need to do this more advanced stuff. If it doesn't, then we
can go into this more detailed matching.

> > Again, I'm no expert, but the algorithm being hard
> > shouldn't be a reason for
> > not doing it. I'm sure as I'm sitting here there are PhD's
> > all over that
> > have papers on the net about good algorithms of this type.
>
> I think this is what I find hard to swallow about your
> suggestion - that
> we just find some smart people from somewhere else to fix the storage
> problem and the spam-matching problem, and then say that the
> spam problem
> has been solved.
>
> Isn't that like saying "I know how to stop SARS. We'll make
> some pills
> that everyone will take. That might be expensive, but we'll get some
> smart manufacturing guys to help make it cheap. And we don't have an
> anti-viral drug that works against SARS yet, but there are a bunch of
> smart pharmacology guys who work on that sort of thing all
> day, they'll
> get that figured out pretty soon. OK, problem solved. Next!"

Granted. But your statement 'And all of this assumes that you have a good
algorithm for deciding whether two hash chains are similar enough to be the
same, or not', makes it sound as though this is a radical new algorithm that
no one has thought of, or that they don't have any good implementations of.
And it's not. I personally could think of a few approaches my self.

> It doesn't really seem reasonable to me to compare
> hypothetical software
> to actual software - the actual software always has bugs &
> limitations,
> while the hypothetical software never seems to have any,
> because it can
> be modified much more quickly than actual software, and is always
> fully debugged & optimized.

You are very right, and I've never said this is better, I'd be a fool to say
so until I see it run. I've said it could be better, and I'm throwing around
ideas, I could be wrong, like you said, until it's fully debugged &
optimized. But the whole point of this is my 'prove me wrong' mentality.
Because if you can prove to me that this won't work, then I won't have to
waste my time trying to implement it.

Keith