Re: [Linuxtuples-devel] Hashtable on the tuple server?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 9/15/06, Robert Belleman <ro...@sc...> wrote:
> A hashtable definitely makes sense.
> We had one
> application here where we exchange tuples that contain
> megabytes of data in one of the first three elements
> And we don't want to force users into adopting certain
> style habits in the structure of their tuples.
> As an alternative, you could compute a hash-key from
> the first N bytes of the first elements in a tuple
> (perhaps md5sum?) and use lists for tuples that hash-
> clash.

http://en.wikipedia.org/wiki/Hash_table

Using a list for the clashes is a common practice, and necessary if
you can't guarantee clashes will never occur. So you always have an
O(N) subsearch at the end, but it can be a lot smaller if your
hashtable is clever. (And as I mentioned earlier, you only lock a
portion of the tuple space, which helps concurrent requests.)

For small hashes, the hashtable can simply be an array in memory with
a lookup time of O(1). For larger hashes it needs some kind of
memory-efficient sparse representation, and the time to search it will
go up. But it's still faster than making the whole tuple space one big
linked list, as we have it now.

I agree that we don't want to enforce style rules on tuples. If we
must, we should let users choose their own rules, e.g. tag their
tuples with a user-specified hashing policy.

It turns out MD5 isn't great for a hashing algorithm. It does a lot of
work to ensure the hash is cryptographically secure; hashtables for
searching don't require anything like that. The Wikipedia article
talks about choice of hash function and suggests some better options.
Google has a hash library that might be relevant, though it's in C++
rather than C.
http://goog-sparsehash.sourceforge.net/

Hashing only the first N-or-fewer bytes of an element is definitely a good idea.

Assuming we decide to tag tuples with hashing policies, we can get any
of these requests for searches:
    * A search with the same hashing policy, the best case, all
benefits of hashing
    are in my favor.
    * A search with a different hashing policy. I should try searching
by its policy
    first, because that search is fast and relevant, and it locks only
a portion of the
    tuple space. If that is unsuccessful, I need to search the entire space.
    * A search with no specified hashing policy. Try each of the
existing policies
    first (since as before, hashed searches are faster and less
invasive) and then
    search the entire space.
We could avoid these two last ugly cases by declaring that each
hashing policy represents a tuple space disjoint from all other tuple
spaces on the same server. There would still be an unhashed
policy-less space but we would encourage users to specify policies for
better performance.

Disjoint spaces isn't a bad idea. Chances are that if you're putting
out a tuple, you have a general idea who the intended recipient is, or
at least you need to specify the tuple's format in general terms, and
it's not much additional effort to include a hashing policy in that
specification.

The simplest hash policy is to say that a tuple should be hashed only
on a single element, and specify which element, e.g. the second.

The next simplest policy I can imagine would be to use bits in a
32-bit word to select which of the first 32 elements should be
hashed.This would require a sparse-representation hash table, since a
full table would be way too huge.

Will