Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Am 15.02.10 00:54, schrieb Donal K. Fellows:
> ...
> than what I really wanted!) with the result, after much experimentation,
> that the FNV hash function is slower than Tcl's original hash function.
> Not that it's a bottleneck for any existing code.
Whatever bottleneck means: since the FNV is slower than the old hash
function, and assuming the "collision attack" is not a problem for
most programs, most programs will be slower (admitted, by a small
degree).

Concerning the "attack": the problem is apparently that "aj" and "ba"
result in the same hash key by using the classical tcl string hash
function. The same is true for any other combination of "aj" and "ba"
as long they have the same length e.g.

    Hash(baba) == Hash(ajaj) == Hash(baaj) == Hash(ajba).

Since the hash is the same, the index of the bucket (based on low
bits) is the same, therefore the same bucket is used for all these
keys. For these hash keys, lookup becomes a linear search. Since the
hash-values are equal, the hash-value-comparison can not avoid the
string comparison.

With 2 two-character pairs, for a string length of 4 we have just 4
collisions (with longer strings, exponentially more as "collide" with
string length 30 shows). The situation is the same for many more 2
character pairs.

How bad is this in practice? This "attack" is not more than a bad
performance in a pathological case, or?

Concerning FNV: for string lengths less then two characters, FNV and
the classic Tcl function return the same hash value (just a side
note). As Lars pointed out, it is in general possible to find strings
$x and $y such that FNV($x) = FNV($y). But for the "collision attack"
hash-equality is not required. The hash function implies a linear
search (in the buckets), if

      INDEX(Hash($x)) == INDEX(Hash($x))

no matter which hash function is used (where INDEX is the function to
calculate from the hash value the bucket). Linear search is only a problem
when the number of entries in the bucket is high.

It is very easy to come up with values, where also with FNV
the hash lookup becomes a linear search. Consider (with FNV
compiled in):

     foreach key {a e i m q u y 1 5 tcl_platform tcl_library} {
        set x($key) 1
     }
     puts [array statistics x]

    % 11 entries in table, 4 buckets
    % number of buckets with 0 entries: 3
    % number of buckets with 10 or more entries: 1
    % average search distance for entry: 6.0

or for example:

    foreach key {
       0z 19 1i 1y 2l 33 3c 3s 46 4f 4v 55 5e 5u 68 6h 6x 7o 82 8b 8r 91 
9a 9q
       a9 ai ay bl c3 cc cs d6 df dv e5 ee eu f8 fh fx go h2 hb hr i1 ia iq
    } {
      set y($key) 1
    }
    puts [array statistics y]

    % 47 entries in table, 16 buckets
    % number of buckets with 0 entries: 15
    % number of buckets with 10 or more entries: 1
    % average search distance for entry: 24.0

In both cases, all entries are added into just one bucket. I see no
reason, why this can't happen for every other reasonable table size.
In the case of "collide" with 32768 hash table entries, just 14 bits
of the hash keys are used to determine the bucket. So, neither FNV
(nor any other hash function) can guarantee that a the buckets are
filled evenly and the degenerated case cannot happen.
>> It looks to me as if the original Tcl hash function has quite
>> good properties for the task it is used.
> So long as people are not using it in a setting where an attacker
> controls the keys, that's true. If an attacker can dictate what values
> are being hashed (true in a number of cases like HTTP header parsing)
> then you've got a significant performance problem.
well, in my understanding "attacker" means "performance attacker".
I guess  the argument in the HTTP case is, that one transmits
30.000 malicious http header entries to slow down the array
access from 0.3 micro seconds to 518 microseconds
(measured my my machine on the "collide" data. I am pretty sure that
transmitting 30.000 header fields costs much more.
>> In order to make hashing faster in Tcl, it is most likely more
>> promising to look at the Tcl-hash machinery altogether.
>
> That's a swamp I've preferred to stay out of, because there are many
> constraints due to the requirement to maintain high degrees of binary
> compatibility. Reworking the hash code in depth is probably an activity
> for 9.0 and not before.
if a high degree of binary compatibility is the only concern:
it should be possible to keep the existing machinery in 8.* in parallel to
a more optimized version, but the main question is most like, is it
worth the effort?

For my current understanding, the main issue is the whole machinery,
not only the hash function. The issues are
  (a) uneven distribution of entries in the bucket,
  (b) linear search in the buckets
  (c)  some more or less constant overhead, which could be improved.

One approach is to improve the bucket-growing policy (e.g. linear
hashing, extendible hashing). This would address (a) and (b), but
require some more work.
Another, conceptually more simple approach is to  replace
the linear search in the buckets by e.g. avl trees (these
tree keep themselves balanced). This would address (b),
compatibility would not be an issue.
Changing the hash function just addresses (a).
> The other alternative is Bob Jenkins's lookup3, but that's big, scary,
> and needs to know the length of the data being hashed (which either
> means a strlen() every time for strings, or forces us to break our
> binary-compatibility guarantees).
Well, HashVarKey in TclVar has the key length - so there
is no strlen() required there. This is exactly the place,
where "collide" stresses the hash function.

When i try different hash functions on "colide",
FNV is not yielding the best results. i get better
figures with  lookup8  (tested on 64bit) and
murmur (references below). Note as well,
that the average search distance decreases from
FNV 3.0 to 2.0 with lookup8 and murmur.
Also the number of buckets without entries
is much better in lookup3/8 and murmur.

-gustaf neumann
references:
http://burtleburtle.net/bob/hash/index.html
http://murmurhash.googlepages.com/

PS: i have the suspicion that the avg search length reported by
"array statistics" is not correct, since it reports always
values ending with ".0"; i have added values which i believe
are more accurate in brackets.

# FNV
57581 microseconds per iteration
32776 entries in table, 16384 buckets
number of buckets with 0 entries: 8331
number of buckets with 1 entries: 632
number of buckets with 2 entries: 1188
number of buckets with 3 entries: 1570
number of buckets with 4 entries: 1602
number of buckets with 5 entries: 1330
number of buckets with 6 entries: 826
number of buckets with 7 entries: 470
number of buckets with 8 entries: 252
number of buckets with 9 entries: 123
number of buckets with 10 or more entries: 60
average search distance for entry: 3.0 (2.0350)

##### murmur
54371 microseconds per iteration
32776 entries in table, 16384 buckets
number of buckets with 0 entries: 2218
number of buckets with 1 entries: 4401
number of buckets with 2 entries: 4457
number of buckets with 3 entries: 2995
number of buckets with 4 entries: 1484
number of buckets with 5 entries: 549
number of buckets with 6 entries: 196
number of buckets with 7 entries: 59
number of buckets with 8 entries: 19
number of buckets with 9 entries: 6
number of buckets with 10 or more entries: 0
average search distance for entry: 2.00 (1.1569)

##### lookup3
53402 microseconds per iteration
32776 entries in table, 16384 buckets
number of buckets with 0 entries: 2237
number of buckets with 1 entries: 4367
number of buckets with 2 entries: 4534
number of buckets with 3 entries: 2855
number of buckets with 4 entries: 1521
number of buckets with 5 entries: 621
number of buckets with 6 entries: 176
number of buckets with 7 entries: 54
number of buckets with 8 entries: 18
number of buckets with 9 entries: 1
number of buckets with 10 or more entries: 0
average search distance for entry: 2.00 (1.1584)

##### lookup8
55521 microseconds per iteration
32776 entries in table, 16384 buckets
number of buckets with 0 entries: 2209
number of buckets with 1 entries: 4413
number of buckets with 2 entries: 4432
number of buckets with 3 entries: 3052
number of buckets with 4 entries: 1413
number of buckets with 5 entries: 595
number of buckets with 6 entries: 200
number of buckets with 7 entries: 49
number of buckets with 8 entries: 17
number of buckets with 9 entries: 3
number of buckets with 10 or more entries: 1
average search distance for entry: 2.0 (1.1561)

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

The Tool Command Language implementation

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm