kent fitch - 2002-06-17

I am motivated by an application for which I'm thinking of using JDBM in which the keys will be quite long Strings and most of them will share many leading characters with their neighbours.

The externalization routines of BPage make it very easy to implement leading String compression during serialisation and deserialisation of the BPage.  Maybe keeping the page in the cache with compressed keys would be nice and performing all the BTree operations on these keys, but it was too complicated for a simple experiment.

The approach I took was on serialisation, to check whether the first key was a String object.  If is was, to find the number of common leading characters between the first and last keys, and if this was greater than a threshold, to insert 2 new objects
into the serialisation stream:  a Boolean (big B) with a value true, and a string containing the common string.  Then the common string was removed from the serialised key (but the key in the
BPage key array was not changed).  Null keys were checked and specially handled.

On deserialisation, when about to read the searialised keys, the first key is read and if it is of type Boolean (big B), then the common string is assumed to follow. This assumption is reasonable, because it is extremely unlikely that a Boolean object would ever be used as a key.  However, it
is also very kludgey, and as discussed below, if this logic were to be added to JDBM, it is probably a much better idea if the serialisation stream is changed to include a boolean (little b) indicating whether leading characters have been identified and removed from the keys.  My JVM (Sun 1.3, winnt) seems to add 65 bytes to the serialisation stream for the first Boolean, which is a big hit, and makes the threshold for compression quite high.

Anyway, to test I got a list of 120,000 titles of poems and other works from a literary database and sorted them in ASCII order to form the JDBM keys.  The JDBM value was the character "X" concatenated with the key value.  The average
key length was 26.5 characters (and the av value length was 1 more!)

Without compression, using the existing serialisation logic with a MRU cache of 100, pagesize of 64, pagesplit of 2 (new feature to reduce page splits when inserting monotonic
keys) on a Celeron 500MHz NT 4 machine with Sun JVM 1.3, the insertion of 120,000 records took 17.6 secs.  Commits were made every 100 records.  A sequential browse took 3.5 sec, 120,000 random reads took 42.6 sec. The database
size was 7.584MB (compared to raw data size of about 6.7MB), and contained 33 non leaf and 1936 leaf pages.

With compression and same parameters and input, the insertion took 19.1 sec. A sequential browse took 4.5 sec, 120,000 random  reads took 48.2 sec.  The database size was 7.248MB, with
number of leaf and non-leaf pages the same as the uncompressed run (as you'd hope!)  A total of 7564 BPage serialisations took place, about half of which generated leading key compression, with
an average leading key compression on those compressed pages of 4.2 characters.

Not suprisingly, removing the common leading part of the key on serialisation and adding it back in on deserialistion consumes CPU time, I suspect primarily via the extra object created for
each key and hence added garbage collector burden.  On this sample data, the space savings are pretty minor (5% overall, 10% on the keys), so unless you had many keys with many common characters, such a change is not likely to be
significant.  But there could be cases (such as my proposed use) where the improvements could be quite major.

This change would also only be of any use to uses of JDBM using String as the key, and I suspect it would be best if it were a BTree creation parameter which could be specified by users who knew they'd get some benefit.

I'll email the changes to Alex just in case he or anyone else is interested.

Kent Fitch
AustLit project