Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, Feb 17, 2010 at 3:56 AM, Donal K. Fellows
<don...@ma...> wrote:
>
> Mod is quite an expensive operation, at least on this machine. Applying
> a mod operation (mod a large prime) to the result of the classic hash
> before returning nearly takes the cost of hashing up to equivalence with
> FNV.

My idea was to just apply the mod once (to the final hash and in the
bucket selection code), however, on further testing the current Tcl
hash produces values which (unless a collision) produce a good bucket
distribution.  You can't apply the mod inside the hash function
because it doesn't have access to the size of the hash table (number
of buckets).

Also, the modulus is not "large" it is just a prime close to the
number of buckets. However, the current bucket code works very well
with the current hash function, both fnv and the djb hash perform
worse.

>> The system only needs to be broken once. All that is required is a set
>> of keys (maybe a few thousand) which hash to the same value. After
>> that the entire algorithm is broken (assuming the goal is to poison
>> the bucket code with collisions). Assuming a 32 bit key-space, it
>> should be easy to find at least one set of 10000 strings which share
>> the same hash value.
>>
>> Do I care to find such a set? NO! But this is a triviality for
>> cryptographers.
>
> Remember, cryptographers are mathematicians. They have a definition of
> "trivial" which doesn't match that used by software engineers. I can
> remember a number of acquaintances state that providing examples once an
> existence proof was done was trivial. Can't say that I really bought
> their argument... :-)

The keyspace is only 32 bits, this is trivial by any measure. I guess
Gustav has already found several sets of keys. Of course, the real
problem is that you only need the low order bits of the hash to match,
so the keyspace for the buckets is much smaller than 32 bits.

> In terms of hash attacks, they only have real fangs (given that we don't
> conflate hashcode equivalence with equality) if it is possible to make
> either:
>
>  a) A large number of *short* keys with the same hashcode, or
>  b) A substantial number of keys with the same hashcode as some
>    particular key from a set not nominated by the attacker (typically
>    defined by the communication protocol)
>
> The aim of the first one is to swamp the hash table with totally
> irrelevant things. The aim of the second is to retard required lookups.

b) fits the typical definition of a collision. Of course collisions
have nothing to do with cryptography, since the key is always compared
anyway. Hopefully the key comparison stops on the first mismatch, so
the actual number of bits compared is more related to the length of a
common prefix between all keys in one bucket.

But since (again) the key is always compared, other types of attacks
would be just as effective in consuming resources: a few very long
keys which hash to the same value, or get into the same bucket.

Also the communication protocol itself will consume more resources
transmitting the data, and the hashing and storage of large quantities
of data is also an attack, in fact the attack would be more effective
if you use a more expensive hash function.

Like I said earlier: the utility of cryptography is found in the
ability to transfer the expensive calculations to the attacker. If
instead you provide a platform so the attacker can make you perform
many expensive calculations, the attacker has already succeeded
without lifting a finger.

It is much better to solve this problem at the application level:
limit the number of headers, or the length of an entire request.
Otherwise the attacker can just send millions of headers instead of
thousands, what is the difference how performance is degraded? DOS
attacks are not sophisticated attacks. Maybe just send a very long url
and see of the whole url gets written to disk in the log file.
Eventually the disk fills up.

>
>> I know there is a huge bias against AOLserver, so I digress to Apache.
>> How does Apache handle headers?
>>
>>> From http://thomas.eibner.dk/apache/table.html (Apache table API):
>>
>> "Tables in Apache works almost like hashes in Perl. The main
>> difference between the two is that you can have two of the same keys
>> in the Apache table API. The table lookups are done in a
>> case-insensitive manner."
>>
>> So for everyone who has some problem with AOLserver, maybe note that
>> headers in Apache are handled very similarly. More specifically they
>> are not handled as associative arrays or dictionaries.
>
> Checking the source to Apache (strictly the APR) I see that they're
> using hashing inside their implementation of tables. OK, they're using a
> truly trivial hash, but I suppose it's better than what they had before
> (a single linear list).
>
> Check for yourself at:
>  http://svn.apache.org/repos/asf/apr/apr/trunk/tables/apr_tables.c

Note that Apache does not modify the key, it can store multiple copies
of the same key, and can search for keys in a case-insensitive manor.
Unfortunately their case-insensitive search is limited to ASCII chars.
Headers are also stored in a linear array (but it looks like you can
actually only store two keys with the same name).

The only way to justify using an array instead of a linear list {{key
value} {key value}} is if you don't think it is important to maintain
the data in the form it was sent. This is a valid argument. But note
that in order to add a key, or lappend to an existing key, you have to
search the array first. A large array will consume more resources
during creation than a list containing the same data.

tom jackson

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm

The Tool Command Language implementation

Re: [TCLCORE] Upgrade of Tcl's String Hash Algorithm