Thread: [Linuxtuples-devel] Hashtable on the tuple server?
Brought to you by:
wware
From: Will W. <ww...@al...> - 2006-09-15 13:12:26
|
Does a hashtable on the server make sense? The idea would be to generate hashes for the first three elments of the tuple, which would index into a cubical table of linked lists of stored tuples. When reading or getting, depending on the number and placement of wildcards, the cube would be searched at a point, or a line segment, or a square. There would need to be separate hashtables for tuples shorter than three elements. I chose three only because I can easily envision a cube, and envisioning a tesseract takes more work. For very large numbers of tuples, and searches with few wildcards, this would make reading and getting much more efficient. It would be a fair amount of work to implement, but the payoff might justify it. The question of whether the hashtable should be linear, square, cubical, etc. would depend on the average or expected number of wildcards in the first few elements of a template, and there would be a tradeoff with hashtable memory size on the server. If the first two elements are rarely wildcards, but the third is usually a wildcard, then a cube offers no advantage over a square. Bram, AFAIK you are currently the heaviest user of LinuxTuples on the planet. Do you have a sense for the distribution of wildcards in the first few elements? Would that be an easy statistic to collect while your system is running? Any thoughts on how many bits a hash value should have? Will |
From: Will W. <ww...@al...> - 2006-09-15 13:19:51
|
Another potential benefit of a hashtable occurs to me. Currently we have one mutex that locks the entire tuple space. With a hashtable, we could lock only portions of the tuple space, leaving the rest available for concurrent accesses. This might also help performance. Will |
From: Bram S. <br...@sa...> - 2006-09-15 13:55:52
|
Will Ware wrote: > Does a hashtable on the server make sense? The idea would be to > generate hashes for the first three elments of the tuple, which would > index into a cubical table of linked lists of stored tuples. When > reading or getting, depending on the number and placement of > wildcards, the cube would be searched at a point, or a line segment, > or a square. There would need to be separate hashtables for tuples > shorter than three elements. I chose three only because I can easily > envision a cube, and envisioning a tesseract takes more work. > > For very large numbers of tuples, and searches with few wildcards, > this would make reading and getting much more efficient. It would be a > fair amount of work to implement, but the payoff might justify it. In theory, yes, hashtables give you O(1) lookup. We now have O(n) lookup. I don't think that a 3D map would be req'd. Just hash all elts to a single hash value, and use STL's std::hash_map<> In practice, you have to be very careful... Only for very large N, does hashing make sense. std::map<> takes e.g. O(logN), and is likely to be just as fast. For me, N is low enough not to notice perf gains with binary or hash lookups. Before you undertake a major effort, I suggest you do a small test on std::list, std::map and std::hash_map, to see at what N you will reap performance benefits. STL has built in hashing funcs for strings, and probably floats and ints as well, so they match your linuxtuples primitives. You would need to do the hashing recursively for nested tuples, else your hashing distribution goes wrong. My gut feeling is that it's not worth the effort. But then again, I don't know what you use the tspace for. If you have 100M tuples in it, I would pursue it. Bram > > The question of whether the hashtable should be linear, square, > cubical, etc. would depend on the average or expected number of > wildcards in the first few elements of a template, and there would be > a tradeoff with hashtable memory size on the server. If the first two > elements are rarely wildcards, but the third is usually a wildcard, > then a cube offers no advantage over a square. > > Bram, AFAIK you are currently the heaviest user of LinuxTuples on the > planet. Do you have a sense for the distribution of wildcards in the > first few elements? Would that be an easy statistic to collect while > your system is running? Any thoughts on how many bits a hash value > should have? > > Will > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Linuxtuples-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxtuples-devel |
From: Will W. <ww...@al...> - 2006-09-15 15:21:07
|
On 9/15/06, Bram Stolk <br...@sa...> wrote: > For me, N is low enough not to notice perf gains with binary > or hash lookups. > My gut feeling is that it's not worth the effort. But then again, > I don't know what you use the tspace for. I'm not using it for anything at the moment, I just had the hashing idea in the back of my mind and thought it might be time to think about it again. I'll add a note about it as a possibility but I won't do anything with it. As far as using STL classes, I'd prefer to stay with pure C. Moving to C++ might discourage some potential users or limit its portability. Maybe I'm just being old-fashioned and should allow myself to be dragged into the 1990s. Will |
From: Will W. <ww...@al...> - 2006-09-15 14:48:11
|
On 9/15/06, Robert Belleman <ro...@sc...> wrote: > A hashtable definitely makes sense. > We had one > application here where we exchange tuples that contain > megabytes of data in one of the first three elements > And we don't want to force users into adopting certain > style habits in the structure of their tuples. > As an alternative, you could compute a hash-key from > the first N bytes of the first elements in a tuple > (perhaps md5sum?) and use lists for tuples that hash- > clash. http://en.wikipedia.org/wiki/Hash_table Using a list for the clashes is a common practice, and necessary if you can't guarantee clashes will never occur. So you always have an O(N) subsearch at the end, but it can be a lot smaller if your hashtable is clever. (And as I mentioned earlier, you only lock a portion of the tuple space, which helps concurrent requests.) For small hashes, the hashtable can simply be an array in memory with a lookup time of O(1). For larger hashes it needs some kind of memory-efficient sparse representation, and the time to search it will go up. But it's still faster than making the whole tuple space one big linked list, as we have it now. I agree that we don't want to enforce style rules on tuples. If we must, we should let users choose their own rules, e.g. tag their tuples with a user-specified hashing policy. It turns out MD5 isn't great for a hashing algorithm. It does a lot of work to ensure the hash is cryptographically secure; hashtables for searching don't require anything like that. The Wikipedia article talks about choice of hash function and suggests some better options. Google has a hash library that might be relevant, though it's in C++ rather than C. http://goog-sparsehash.sourceforge.net/ Hashing only the first N-or-fewer bytes of an element is definitely a good idea. Assuming we decide to tag tuples with hashing policies, we can get any of these requests for searches: * A search with the same hashing policy, the best case, all benefits of hashing are in my favor. * A search with a different hashing policy. I should try searching by its policy first, because that search is fast and relevant, and it locks only a portion of the tuple space. If that is unsuccessful, I need to search the entire space. * A search with no specified hashing policy. Try each of the existing policies first (since as before, hashed searches are faster and less invasive) and then search the entire space. We could avoid these two last ugly cases by declaring that each hashing policy represents a tuple space disjoint from all other tuple spaces on the same server. There would still be an unhashed policy-less space but we would encourage users to specify policies for better performance. Disjoint spaces isn't a bad idea. Chances are that if you're putting out a tuple, you have a general idea who the intended recipient is, or at least you need to specify the tuple's format in general terms, and it's not much additional effort to include a hashing policy in that specification. The simplest hash policy is to say that a tuple should be hashed only on a single element, and specify which element, e.g. the second. The next simplest policy I can imagine would be to use bits in a 32-bit word to select which of the first 32 elements should be hashed.This would require a sparse-representation hash table, since a full table would be way too huge. Will |