Mod to JDBM to support monotonic key addition

kent fitch
  • kent fitch

    kent fitch - 2002-05-22

    I've been extremely impressed with my testing of JDBM - small, fast, simple!

    One potential use I've been considering is to store info which has a timestamp string as the first part of the key.  The data will be added with the key monotically increasing (almost always!), and I'm worried that this pattern of key addition will cause index pages to be split and the "old" half will remain half empty, resulting in a greater index depth than necessary.

    From a quick look at the code,  the split at the "keys.length/2" point should be possible to change, with perhaps a split point defined at the BTree level, but I'm not sure of all the ramifications.  Has anyone done this yet?  If not,
    can someone experienced with the code offer any warnings or advice on such a change?

    Kent Fitch
    AustLit project

    • Alex Boisvert

      Alex Boisvert - 2002-05-23

      Hi Kent,

      The BTree implementation in JDBM will always keep the tree balanced, no matter the insertion order.  It might not be always 'perfectly' balanced but reasonably so.   (ie. not enough to be concerned with, IMHO)

      The most important drawback if you insert keys in monotonic order is you will experience a lot of page splits/merge to rebalance the tree.

      In short, you will pay a little during insertion because of increased I/O to reshuffle BPages around, but lookups should always be close to optimal (as compared to an idealy balanced tree).

      While I think your proposed change should work, I suggest doing your own measurements first before changing the code.  (And of course, if so do, I'd be interested in the results!)


    • kent fitch

      kent fitch - 2002-05-24

      Thanks for your reply, Alex.

      I'll do as you suggest - some measurements first!! The first range of tests I did involved inserting 240,000 records.  On the first run, they were added with monotonically increasing keys and the index size was around 24MB (the keys and values are quite long: 15 - 30 characters).  On the second run the keys were presented in random order.  The index was only very slighly smaller (22MB) but the run time was much much longer (increased from about 50 sec to about 300 sec), maybe due to cache effects.  I was using a page size of 64.
      (JDK 1.4, Celeron 500)

      I ran a much smaller test of a few hundred records so that I could reasonably "dump" the index, and I observed when adding in key order that most of the index pages had 32 of the 64 slots used (if I interpreted the dump output correctly), which gave me pause to think that the monotonic key insertion may be causing the index pages to be left half empty, but I'll examine the code to see where the rebalancing and merging is going on - I didnt spot this although I wasnt looking for it, and I'm a bit hazy on how the splitting actually works.

      I'll try to test the proposed changes, measure the results and let you know how it goes.

      I'm also interested in the pros and cons of larger page sizes.


      Kent Fitch
      AustLit Project

      • Alex Boisvert

        Alex Boisvert - 2002-05-24


        I'm surprised by your performance numbers pointing to random inserts as being slower.  I was expecting monotonic keys to produce higher re-organization costs.  Maybe a cache issue, as you point out.  What size cache are you using?

        The fact that each page is only at half the maximum capacity is a consequence of ordered inserts.  It is, in fact, a degenerate case for the B+Tree.

        About page size, ideally you want to have a BPage fit (but not overflow) a physical page on disk, which is usually 4KB.  That way, one disk read can give you a page and you can narrow down the search to the most appropriate page on the subsequent read.

        I found a Bpage size of 64 entries to be a good fit for most purposes -- requiring at most 4 disk reads for indexes containing 1 million records -- assuming keys are relatively small (< 32 characters).


    • kent fitch

      kent fitch - 2002-05-29

      Hi Alex,

      I've done a little more digging and I'll include the "benchmark" code I'm using.  The summary of runs inserting 10,000 records into an empty database with a page size of 64:

      insert mode: random
      commit mode: after every insert
      produced index size: 800KB
      elapsed time: 34 secs

      insert mode: random
      commit mode: after every 1000 records
      produced index size: 800KB
      elapsed time: 20 secs

      insert mode: ordered
      commit mode: after every insert
      produced index size: 656KB
      elapsed time: 31 secs

      insert mode: ordered
      commit mode: after every 1000 records
      produced index size: 608KB
      elapsed time: 1.7 secs

      So, ordered inserts with infrequent commits is spectacularly faster, and otherwise random or ordered makes not much difference.  I tried running with cache sizes of 20 and 200 with insignificant differences.

      Here's the core of the code (inserted within your "FamousPeople" source):

      long startTime = new java.util.Date().getTime() ;
      Random rand = new java.util.Random() ;
      boolean randomOrder = true ;
      for (int c=0;c<10000;c++) {
        int r = rand.nextInt(99999) ;
        int chosenKeyPrefix = 1000000 + ((randomOrder) ? r : c) ;
        String key = "" + chosenKeyPrefix + " " + new java.util.Date().getTime() ;
        tree.insert(key, key + " VALUE", false );
        if ((c%1000) == 0) recman.commit();
      System.out.println("done elapsed " + (new Date().getTime() - startTime)) ;

      And different runs changed the randomOrder  value
      and ommited the conditional on the commit.


      Kent Fitch
      AustLit Project

      • Alex Boisvert

        Alex Boisvert - 2002-05-29

        Thanks for the numbers, Kent.

        I looks like profiling and analysis of the BTree is in order to understand what's going on under the cover.  Out of curiosity, I'll be looking into that this week.


    • kent fitch

      kent fitch - 2002-05-29

      I've finally gotten around to understanding the splitting code and benchmarking the effects on performance when adding monotonic keys.  I'm using a program much the same as the previous benchmark program I posted, but adding 20000 records always in increasing key order.

      For small page sizes with this number of records, the effect on BTree height is quite dramatic as you'd expect, but for something more realistic (page size of 32), the change is not that great - maybe it would take more records to exhibit.

      With pagesize of 32 and the current code (ie, split point at "16:16" - half way through):
      btree height: 4
      # of leaf nodes:768
      # of non-leaf nodes: 52
      index size: 1.26MB
      # of splits performed: 1329

      Changing the split point to "31:1" :
      btree height: 3
      # of leaf nodes:625
      # of non-leaf nodes: 21
      index size: 1.18MB
      # of splits performed: 665

      Changing the split point to "30:2" :
      btree height: 3
      # of leaf nodes:660
      # of non-leaf nodes: 23
      index size: 1.18MB
      # of splits performed: 688

      I'm not at all convinced about the page counting, as it just doesnt make sense - with a 50:50 split, the leaf nodes ALL have a "first" value of 16,
      but for 20000 records there should be 1250 leaf nodes with 16 key/value pairs on each!  I count them by doing a dump on the tree and counting the number of times the dumpRecursive method is invoked with and without the _isLeaf flag set.  Similarly, there dont seem to be enough blocks for the 31:1 split!  Oh well, something else I'm not understanding!

      I suspect (but havent measured) a degradation in performance if keys are added randomly with a non-even split.  If I get time I'll do some more realistic tests with a page size of 64 and 1 million records ordered and unordered, but unless the results are dramatic improvements (ie, factor of 2) I probably wont bother.  The mods to the BPage code are  simple (and I can post them if anyone is interested), but thoroughly testing them (including the impact if any on the 'remove' code which I havent examined, and trying to explain the leaf node counts) and integrating it into the API and making the split point an attribute of the BTree is a bigger deal, although I suppose the default behaviour could be to split 50%:50% as now.


      Kent Fitch
      AustLit Project

    • kent fitch

      kent fitch - 2002-05-31

      Hi Alex,

      I've found the problem with the suspect page counts reported
      in the previous post. The problem was caused by code in the
      BPage "dumpRecursive()" method and caused the right-most
      part of the tree to be not navigated and hence pages
      counted.  I think this line in dumpRecursive() is bogus,
      because a null "key" is expected for the right-most pages:

          if ( _keys[ i ] == null ) break;

      Anyway, I've added custom stats gathering to BTree and BPage
      and rerun the tests.  I've also verified with a separate "Browse"
      program that all the keys are persisted and the index seems
      to work (although I have not tested removing keys).

      Here are the updated benchmark results:

      For all runs:
          KEY length: 22 characters
          VALUE length: 28 characters
          commit after every 100 inserts and after final insert


      Cache size for all runs: 20

      * Page Size:32, splitPoint:1, insertTime:32 sec DBSIZE 11.4MB
        Tree stats:  size:200000, height:4, splits:6665, nonLeafPages:217, leafPages:6452
        Average used entries per leaf page:30

      * Page Size:32, splitPoint:2, insertTime:32 sec DBSIZE 11.4MB
        Tree stats:  size:200000, height:4, splits:6895, nonLeafPages:232, leafPages:6667
        Average used entries per leaf page:29

      * Page Size:32, splitPoint:16 (ie, DEFAULT), insertTime:39 sec DBSIZE 12.2MB
        Tree stats:  size:200000, height:5, splits:13327, nonLeafPages:832, leafPages:12500
        Average used entries per leaf page:16

      ADDING approx 200,000 RANDOM KEYS (some duplicates will be generated and discared)

      (Cache size of 20 was infeasible - increased to 300 for all runs)

      * Page Size:32, splitPoint:2, insertTime:723 sec DBSIZE 40.8MB
        Tree stats:  size:199997, height:5, splits:23157, nonLeafPages:2600, leafPages:20562
        Average used entries per leaf page:9

      * Page Size:32, splitPoint:16 (ie, DEFAULT), insertTime:500 sec DBSIZE 15.2MB
        Tree stats:  size:199993, height:4, splits:9379, nonLeafPages:437, leafPages:8946
        Average used entries per leaf page:22

      So the summary is:

      - for monotonically increasing keys, we can
      achieve ~20% reduced elapsed insertion time, ~10% space saving
      and a height reduction, which should result in faster retrieval
      (but I havent measured this.  Pro's are the smaller number of
      pages (hence greater cache efficiency and search depth, the con
      is that the binary sort to find an entry each page will take
      longer because each page is fuller).

      - montonically increasing keys greatly increase number of
      non-leaf and leaf pages for the 50:50 split case

      - for random keys, 50:50 split works very well.  The very long
      insertion times are I guess caused by too small a cache to
      hold the entire index which would greatly aid performance
      on random insertions.

      - for random keys, a skewed split is disasterous, so know
      your insertion patterns before moving away from a 50:50 split!

      I'll prepare the diffs for this change and email it
      direct to you for your consideration.


      Kent Fitch
      AustLit Project


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks