Logical containers and primary key indices.

  • Bryan Thompson
    Bryan Thompson


    I would like to share some preliminary results with you on an approach
    that I think of as "logical containers" that gets at many of the same
    advantages as btrees.  A logical container is comprised of a primary
    physical container that collects a linked list of zero or more
    secondary physical containers.  A logical container hides from the
    application the size limits of a physical container.  Each physical
    container groups zero or more persistent objects and knows how to
    resolve a global reference first to the physical container and then to
    the contained object.

    Logical containers act much like unordered btrees.  Inserts into the
    primary physical container if there is space available and otherwise
    into the tail of the linked list of secondary containers.  A new
    container is appended to the tail of the linked list on an insert when
    the tail is full.  If an object is removed from a physical container
    then it is rotated to the tail of the list, so the next insert will be
    into that container.  If a physical container becomes full then it is
    rotated to the head of the list as part of the insert operation in
    which it becomes full.

    Each object receives a long recid.  The lower 16 or 32 bits used for
    the address space within a physical container, so the recman recid of
    the container is found by masked those bits and using only the higher
    order bits.  (I shift all recids generated by the recman 16 or 32 bits
    left so that the lower order bits of a global object are always zero.)

    Lookup may be somewhat faster since you can do this bit mask operation
    and then ask the recman directly for the correct container.  Once the
    container has been fetched, you ask the container to resolve the lower
    order bits, which it does by scanning to see which entry in that
    container has the desired recid.  This is pretty much what the BPage
    does.  There is a key[] (recids of the children) and a value[] (the
    children themselves).

    Here are some results with our application:

    Primary key index: pageSize=32.  Average object size not estimated.
    Average rate: 20/second. (26/second with 16 per container).

    Logical containers: maxPerContainer=32.  Average object size not
    estimated.  Average rate: 33/second.  (31/second with 16 per

    Based on this, you can see that the "logical container" approach
    offers an alternative to the "primary key index" approach that you
    outlined in another thread.

    If you look at the profile data below, you can see the actual #of
    read(fetch), create(insert), update, and destroy(remove) operations,
    which will give you a sense of the distribution of operations during
    this test.

    read    time: 74100, 57.2% of total, # of entrances: 24121133
    create  time: 21178, 16.4% of total, # of entrances: 1124083
    update  time:  6349,  4.9% of total, # of entrances: 3160730
    destroy time:     0,  0.0% of total, # of entrances: 1

    Essentially, logical containers provide an unordered alternative to
    btrees.  However, what may be more telling is that each logical
    container pertains to a specific global application object and
    provides a means to encapsulate related application objects within the
    same persistent object (or within secondary containers when the
    primary container overflows).  This is like having one primary key
    index per global object, and local objects are allocated within that
    index.  An interesting test would be to do exactly that - have one
    BTree per global object.  The advantage of the logical container
    design is that the global object is the primary logical container, so
    you only fetch one object and get all the contained children in the
    same operation.  I don't see how you could do this with the current
    BTree implementation in jdbm.

    I'll see if I can refactor the logical container code into something
    that is general enough to submit as a contribution to the jdbm project
    so that other people can take a look at it as well.



    • Kevin Day
      Kevin Day

      This one is going to take me some time to digest - please bear with me while I ask questions and get my head around it...

      Can you please clarify your use of the term 'Global application object'?  Maybe we should use a concrete example to help things...  How about a banking example?  We would have business objects to represent Owners, Accounts and Transactions.

      In that scenario, how does a logical container represent one of these?

      Would each class have a dedicated "physical container"?  From my reading this would basically provide you with the equivalent of a namespace for your recids (Owners would have recids with certain high-order 16 bits, Accounts would have another, and Transactions would have yet another range of recids).  You've taken it a bit further in saying that you can have mutliple high-order bit values for a single class (all grouped together in a linked list).

      Are you, then, using the maxPerContainer size to determine how many bits to use for the container vs internal offset?  For 32 objects per container, you'd have 56 bits (64 - 8) left for container id.

      If my understanding of things is correct, then is is really a proposal for an alternative way of obtaining a type of paging behavior that clusters based on object type alone.  I'm not sure, yet, what advantages this approach has over just using the record manager directly...  Unless there is some fancy clustering that is being done in the selection of recids (i.e. containers).

      I'm puzzled by the following statement - can you please expand a bit on it? :

      "The advantage of the logical container design is that the global object is the primary logical container, so you only fetch one object and get all the contained children in the same operation."

      What exactly do you mean by "contained children" ?  Can you give me an example given the sample application I described above?

      Finally, I'd like to do a quick clarification:  When I talk about primary key BTrees, I am absolutely using one BTree per business object class.  So, in the example, I would have 3 primary BTrees - one each for Owner, Account and Transacation.

      So, the recid in the BTree for each object is the full range of Long.  If I want to achieve clustering on the Transaction index to optimize disk read times, I could develop a hashing algorithm for determining new recids for the Transaction BTree that would use components of the Account recid as it's first 16 bits (or whatever is appropriate), then uses a counter for the remaining recid bits.

      This is the reason that I've been keen on implementing the MRQ at the BTree level - each primary index will have it's own MRQ.

      If clustering wasn't desired, then the same behavior could be achieved by using the MRQ at the record manager level, and just storing objects directly into the manager (i.e. no primary index at all).

      What I'm describing above is actually an approach that may make sense... as long as I can figure out why inserts of smallish objects directly in the recman are so slow.

      We could use the logical row manager to implement clustering as part of a "pack and optimize" routine on jdbm - packing would be a matter of defragmenting the recordifle paging system.  Optimizing would be a matter of re-ordering records on disk so they are in the order of a specified index.

      The downside to that approach is that it would require a much better understanding of the logcal row manager, and the optimization wouldn't "just happen"...

      I understand BTrees...  Trying to futz with the logical row manager may not be something that we really want to get into...  (Although it does seem straightforward).

      From an algorithm perspective, optimization (and packing) could look something like this:

      Make a copy of the logical row lookup table
      Walk each index marked as primary. 
      For each recid returned, write the row to a new jdbm database, and zero out the value in the lookup table copy.
      After all indexes have been walked, blast through the row lookup table, and copy bytes for each entry that is not '0' into the new table.
      Close new jdbm database
      Close old jdbm database
      Delete old jdbm database
      Rename new jdbm database

      The above could be done in a meta-transaction (handled by file naming, instead of with the log file), so the above would be done with transactions disabled.  If there was a failure, the new jdbm database would just be deleted.



      - K