Re: [Linuxtuples-devel] Hashtable on the tuple server?
Brought to you by:
wware
From: Will W. <ww...@al...> - 2006-09-15 14:48:11
|
On 9/15/06, Robert Belleman <ro...@sc...> wrote: > A hashtable definitely makes sense. > We had one > application here where we exchange tuples that contain > megabytes of data in one of the first three elements > And we don't want to force users into adopting certain > style habits in the structure of their tuples. > As an alternative, you could compute a hash-key from > the first N bytes of the first elements in a tuple > (perhaps md5sum?) and use lists for tuples that hash- > clash. http://en.wikipedia.org/wiki/Hash_table Using a list for the clashes is a common practice, and necessary if you can't guarantee clashes will never occur. So you always have an O(N) subsearch at the end, but it can be a lot smaller if your hashtable is clever. (And as I mentioned earlier, you only lock a portion of the tuple space, which helps concurrent requests.) For small hashes, the hashtable can simply be an array in memory with a lookup time of O(1). For larger hashes it needs some kind of memory-efficient sparse representation, and the time to search it will go up. But it's still faster than making the whole tuple space one big linked list, as we have it now. I agree that we don't want to enforce style rules on tuples. If we must, we should let users choose their own rules, e.g. tag their tuples with a user-specified hashing policy. It turns out MD5 isn't great for a hashing algorithm. It does a lot of work to ensure the hash is cryptographically secure; hashtables for searching don't require anything like that. The Wikipedia article talks about choice of hash function and suggests some better options. Google has a hash library that might be relevant, though it's in C++ rather than C. http://goog-sparsehash.sourceforge.net/ Hashing only the first N-or-fewer bytes of an element is definitely a good idea. Assuming we decide to tag tuples with hashing policies, we can get any of these requests for searches: * A search with the same hashing policy, the best case, all benefits of hashing are in my favor. * A search with a different hashing policy. I should try searching by its policy first, because that search is fast and relevant, and it locks only a portion of the tuple space. If that is unsuccessful, I need to search the entire space. * A search with no specified hashing policy. Try each of the existing policies first (since as before, hashed searches are faster and less invasive) and then search the entire space. We could avoid these two last ugly cases by declaring that each hashing policy represents a tuple space disjoint from all other tuple spaces on the same server. There would still be an unhashed policy-less space but we would encourage users to specify policies for better performance. Disjoint spaces isn't a bad idea. Chances are that if you're putting out a tuple, you have a general idea who the intended recipient is, or at least you need to specify the tuple's format in general terms, and it's not much additional effort to include a hashing policy in that specification. The simplest hash policy is to say that a tuple should be hashed only on a single element, and specify which element, e.g. the second. The next simplest policy I can imagine would be to use bits in a 32-bit word to select which of the first 32 elements should be hashed.This would require a sparse-representation hash table, since a full table would be way too huge. Will |