Hi. I have some code that builds an index over a potentially large data set. Currently this is implemented as hashtables. Some entries are 1:1, some are 1:N. For instance, one map has around five entries, each containing a collection of around 100,000 values. I tend to use Google's Multimap implementation for this.
It has got to the point that it is too large for memory. I want to store the index on disk, suitably cached. Do you think JDBM's HTree implementation would be a suitable choice? My worry is that put(key, value) would mean a get, add, then put which, on a 100,000 size collection, would be wasteful. I wonder if anyone had any experience doing this, or could suggest a better data structure?
The way I handle this sort of thing is with a B+Tree that uses Key+Value as the key.
1234:ABCD -> Null Object
1234:DEFG -> Null Object
So I don't use the values at all. This gives me multi-map capability (I can browse the tree starting at 1234: and ending at 1234:zzzzzz or some other suitable maximum value.
I don't know that this would translate to HTree, though.
One option would be to store keys in the HTree, with values that are a pointer to a different data structure that contains your list. BTree would probably be appropriate for your list in this case.
Personally, I would go with the BTree only approach first and see if there are performance problems - it is by far the simplest approach.
Thanks for the suggestion. I implemented this over the past couple of nights and it worked nicely.
This tool is looking very promising.
I'm now going to ramble on a few things to see if I can get the best possible use out of JDBM. If you have any input, do let me know.
I started this because the index I am building got too big for memory. So I wanted to persist to disk somehow, but I didn't want the burden of a SQL store. My concerns were: performance hit, actual memory saved, disk space used. The first two, from my initial stress tests, seem fine. Disk usage, at about 180MB with transaction log switched off, is something I would like to improve on a little. Here are my ideas:
The multipart keys you suggested are very long. For instance, their structure currently look like this:
(Obviously this is representative data - my actual data is far, longer, probably in the order of seventy characters per full multipart key).
My first inclination is to normalise. I could have HTrees for teams, positions and players. E.g:
england -> A
forward -> A
midfielder -> B
So the keys become:
This would require (number of key parts) lookups before the full lookup. It should save space though, in principle, although I'm not sure if JDBM does anything similar itself.
Also, I noticed mention of a 'compressor' in another thread but I see no existence in the version I downloaded.
If you have any thoughts I'd be very interested.
I suggest you try Kevin's other suggestion then. Create a HTree with your top-level keys and with pointers (recid; java long) to BTree containing the multiple values in associated with the key. This should help with space consumption and keep your lookups fairly efficient.
The key compressor works very well in these situations - key compression has been in the code base for quite some time. Here is code I use for constructing me indexes:
tree = BTree.createInstance(recman, keyComparator, keySerializer, EmptyValueSerializer.INSTANCE);
Thanks. I might be using the wrong version, but setKeyCompressionProvider is definitely not on the BTree API I have. I'm using the src distribution downloaded from http://jdbm.sourceforge.net/
Check out and build from here:
I don't know if there is an actual dstro (pre-built jar file) or not, but the above will get you what you are asking.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.