|
From: Donal K. F. <don...@ma...> - 2010-02-17 11:56:54
|
On 17/02/2010 01:36, Tom Jackson wrote:
> No shit! The cost is extreme.
I've been probing the cost of using various hash functions (I've done
the comparison of the classic algorithm against FNV and lookup3) when
used in Tcl and the attached script contains preliminary results for the
two keysets:
all dictionary words
numbers from 0 to 500k
The tests basically check the cost of doing 9 hash lookups per key;
array lookups are close to as pure a test of this as we can generate in
Tcl. (The cost of the "machinery" is about a third of the total time.)
The cost of FNV over these (presumably near normal) keys is 3–10% over
the classic algorithm. The cost of lookup3 is 10-30% over the classic
algorithm. Other key-sets might have other costs.
Note that the only thing that is changing between the runs is the
contents of the function TclHashObjKey in tclObj.c, and the change was
done by altering #ifdefs only. Everything else is just Tcl HEAD (plus a
compiled in lookup3.c; I'm not retyping *that* code!)
> The problem here is that in order to get good distribution to all
> bits, you have to use a near cryptographic quality hash. A
> cryptographic hash can be easily truncated to generate near random
> bucket indexes. This is the desired goal, but it is much more easily
> achieved most of the time with a single modulus operation which takes
> into account all bits. Which is easier: one modulus operation or a
> near-cryptographic hash which can be safely truncated?
Mod is quite an expensive operation, at least on this machine. Applying
a mod operation (mod a large prime) to the result of the classic hash
before returning nearly takes the cost of hashing up to equivalence with
FNV.
> The system only needs to be broken once. All that is required is a set
> of keys (maybe a few thousand) which hash to the same value. After
> that the entire algorithm is broken (assuming the goal is to poison
> the bucket code with collisions). Assuming a 32 bit key-space, it
> should be easy to find at least one set of 10000 strings which share
> the same hash value.
>
> Do I care to find such a set? NO! But this is a triviality for
> cryptographers.
Remember, cryptographers are mathematicians. They have a definition of
"trivial" which doesn't match that used by software engineers. I can
remember a number of acquaintances state that providing examples once an
existence proof was done was trivial. Can't say that I really bought
their argument... :-)
In terms of hash attacks, they only have real fangs (given that we don't
conflate hashcode equivalence with equality) if it is possible to make
either:
a) A large number of *short* keys with the same hashcode, or
b) A substantial number of keys with the same hashcode as some
particular key from a set not nominated by the attacker (typically
defined by the communication protocol)
The aim of the first one is to swamp the hash table with totally
irrelevant things. The aim of the second is to retard required lookups.
> I know there is a huge bias against AOLserver, so I digress to Apache.
> How does Apache handle headers?
>
>> From http://thomas.eibner.dk/apache/table.html (Apache table API):
>
> "Tables in Apache works almost like hashes in Perl. The main
> difference between the two is that you can have two of the same keys
> in the Apache table API. The table lookups are done in a
> case-insensitive manner."
>
> So for everyone who has some problem with AOLserver, maybe note that
> headers in Apache are handled very similarly. More specifically they
> are not handled as associative arrays or dictionaries.
Checking the source to Apache (strictly the APR) I see that they're
using hashing inside their implementation of tables. OK, they're using a
truly trivial hash, but I suppose it's better than what they had before
(a single linear list).
Check for yourself at:
http://svn.apache.org/repos/asf/apr/apr/trunk/tables/apr_tables.c
> All I can say is that the more I investigate the original Tcl hash,
> the more my respect grows. It seems to perform better most of the
> time.
It's very very fast, but weak. With non-malicious inputs, its weakness
doesn't really matter all that much and its speed is nice. With
malicious inputs, which are easy to create, it's no better than a linear
list (plus some memory costs). The alternative hash functions that are
recommended in the literature (FNV, lookup3) are slower, but harder to
push into the ill-performing case (in the case of lookup3, it is
believed to require effort nearly equivalent to just scanning through
all possible inputs, which isn't a threat; FNV is faster but weaker).
Donal.
|