Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 16/02/2010 15:25, Gustaf Neumann wrote:
> How bad is this in practice? This "attack" is not more than a bad
> performance in a pathological case, or?

It's exactly a performance issue. The issue is exactly that it is far
too easy to trigger this worst-case behaviour. When the behaviour is
triggered, Tcl is still functioning correctly though. It never thinks
two keys are the same unless they've been compared using full equality.
(It just happens to think they're different if the hashcodes are
different; it's a one-way implication.)

> Concerning FNV: for string lengths less then two characters, FNV and
> the classic Tcl function return the same hash value (just a side
> note).

Only because I'd made a mistake. :-) The FNV algorithm uses a baseline
value which means that the hashcodes are different.

> As Lars pointed out, it is in general possible to find strings
> $x and $y such that FNV($x) = FNV($y). But for the "collision attack"
> hash-equality is not required. The hash function implies a linear
> search (in the buckets), if
>
>        INDEX(Hash($x)) == INDEX(Hash($x))
>
> no matter which hash function is used (where INDEX is the function to
> calculate from the hash value the bucket). Linear search is only a problem
> when the number of entries in the bucket is high.

Precisely. Even 8 values per bucket isn't a big problem. The problem
comes when you go into the realm of thousands. That's when it matters.

> It is very easy to come up with values, where also with FNV
> the hash lookup becomes a linear search. Consider (with FNV
> compiled in):
[...]
> In both cases, all entries are added into just one bucket. I see no
> reason, why this can't happen for every other reasonable table size.

If the hashcodes are the same, the problem happens. If the hashcodes are
only partially the same (to some number of LSBs) then small hash tables
have a problem, but things will go away over some size.

> In the case of "collide" with 32768 hash table entries, just 14 bits
> of the hash keys are used to determine the bucket. So, neither FNV
> (nor any other hash function) can guarantee that a the buckets are
> filled evenly and the degenerated case cannot happen.

The point is that it should be "hard" to find a large number of
collisions with any particular value. Preferably there should be minimal
collisions at all, but that's tricky to achieve. But being able to bury
things that you *do* want is worse. That's because typically in practice
most keys are not created equal.

> if a high degree of binary compatibility is the only concern:
> it should be possible to keep the existing machinery in 8.* in parallel to
> a more optimized version, but the main question is most like, is it
> worth the effort?

The only major chunk of effort round here is this email thread. :-)

[The rest of your message I'll get around to a bit later tonight.]

Donal.

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

The Tool Command Language implementation

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm