Awhile back, Kent Fitch suggested a concept for key compression in the BTree. He put together a proof of concept that only worked with string keys.
I did not want to use Kent's code "as is" because of a couple of details - namely, lack of support for generalized byte arrays, and the mechanism you used for determining whether compression was active - storing a Byte in the object stream - was space inneficient. I'm hoping you (Kent) will provide feedback on what I've implemented.
I have taken Kent's ideas and extrapolated a bit to develop a generalized leading value key compression algorithm. The algorithm works on byte arrays, so it should work for any type of key that uses a custom serializer.
The biggest question I have now is the best approach, architecturally, to add this to the jdbm code.
The architecture that I currently have implemented has the following classes/interfaces:
User can call setKeyCompressionProvider(CompressionProvider) on the Btree when it is created.
The compression provider is used to obtain new, initialized ByteArrayCompressor and ByteArrayDecompressor objects. We can't use the same compressor/decompressor for all BTrees, because these maintain internal state.
The BPage's serialization mechanism obtains a ByteArrayCompressor from the CompressionProvider associated with the BTree, and send's the results of each key's serialization into it. The compression provider, in turn, writes the compressed byte stream to the DataOutput that the BPage is using for it's serialization.
CompressionProvider can be implementedwith different compression strategies. Right now, I've developed a DefaultCompressionProvider that just writes data out without compression (i.e. does exactly what the BPage's standard serialization does), and LeadingValueCompressionProvider that remembers the previous value of a given key and only stores data that is different between the current key and the previous.
I *think* that this architecture makes sense, but it does add a number of interfaces and classes... am I making this more complicated than it needs to be?
Anybody have any suggestions or questions about the design?
I'll upload a patch with the current design if anyone wants to give it a spin... (Alex - this is a big enough change that I don't want to commit it without a bit of discussion).
In terms of performance, I put together a quick test that reads the filenames from my file system, then inserts them along with the length of the file, into a tree with and without compression. I'm seeing a 50% reduction in total file size, and performance is the same (either the extra processing added by the compession is made up for by writing less data to disk, or - more likely - the overall performance is driven by blocking on disk access, so the CPU overhead just doesn't matter).
This works for me.