I would like to share some preliminary results with you on an approach
that I think of as "logical containers" that gets at many of the same
advantages as btrees. A logical container is comprised of a primary
physical container that collects a linked list of zero or more
secondary physical containers. A logical container hides from the
application the size limits of a physical container. Each physical
container groups zero or more persistent objects and knows how to
resolve a global reference first to the physical container and then to
the contained object.
Logical containers act much like unordered btrees. Inserts into the
primary physical container if there is space available and otherwise
into the tail of the linked list of secondary containers. A new
container is appended to the tail of the linked list on an insert when
the tail is full. If an object is removed from a physical container
then it is rotated to the tail of the list, so the next insert will be
into that container. If a physical container becomes full then it is
rotated to the head of the list as part of the insert operation in
which it becomes full.
Each object receives a long recid. The lower 16 or 32 bits used for
the address space within a physical container, so the recman recid of
the container is found by masked those bits and using only the higher
order bits. (I shift all recids generated by the recman 16 or 32 bits
left so that the lower order bits of a global object are always zero.)
Lookup may be somewhat faster since you can do this bit mask operation
and then ask the recman directly for the correct container. Once the
container has been fetched, you ask the container to resolve the lower
order bits, which it does by scanning to see which entry in that
container has the desired recid. This is pretty much what the BPage
does. There is a key (recids of the children) and a value (the
Here are some results with our application:
Primary key index: pageSize=32. Average object size not estimated.
Average rate: 20/second. (26/second with 16 per container).
Logical containers: maxPerContainer=32. Average object size not
estimated. Average rate: 33/second. (31/second with 16 per
Based on this, you can see that the "logical container" approach
offers an alternative to the "primary key index" approach that you
outlined in another thread.
If you look at the profile data below, you can see the actual #of
read(fetch), create(insert), update, and destroy(remove) operations,
which will give you a sense of the distribution of operations during
read time: 74100, 57.2% of total, # of entrances: 24121133
create time: 21178, 16.4% of total, # of entrances: 1124083
update time: 6349, 4.9% of total, # of entrances: 3160730
destroy time: 0, 0.0% of total, # of entrances: 1
Essentially, logical containers provide an unordered alternative to
btrees. However, what may be more telling is that each logical
container pertains to a specific global application object and
provides a means to encapsulate related application objects within the
same persistent object (or within secondary containers when the
primary container overflows). This is like having one primary key
index per global object, and local objects are allocated within that
index. An interesting test would be to do exactly that - have one
BTree per global object. The advantage of the logical container
design is that the global object is the primary logical container, so
you only fetch one object and get all the contained children in the
same operation. I don't see how you could do this with the current
BTree implementation in jdbm.
I'll see if I can refactor the logical container code into something
that is general enough to submit as a contribution to the jdbm project
so that other people can take a look at it as well.
This one is going to take me some time to digest - please bear with me while I ask questions and get my head around it...
Can you please clarify your use of the term 'Global application object'? Maybe we should use a concrete example to help things... How about a banking example? We would have business objects to represent Owners, Accounts and Transactions.
In that scenario, how does a logical container represent one of these?
Would each class have a dedicated "physical container"? From my reading this would basically provide you with the equivalent of a namespace for your recids (Owners would have recids with certain high-order 16 bits, Accounts would have another, and Transactions would have yet another range of recids). You've taken it a bit further in saying that you can have mutliple high-order bit values for a single class (all grouped together in a linked list).
Are you, then, using the maxPerContainer size to determine how many bits to use for the container vs internal offset? For 32 objects per container, you'd have 56 bits (64 - 8) left for container id.
If my understanding of things is correct, then is is really a proposal for an alternative way of obtaining a type of paging behavior that clusters based on object type alone. I'm not sure, yet, what advantages this approach has over just using the record manager directly... Unless there is some fancy clustering that is being done in the selection of recids (i.e. containers).
I'm puzzled by the following statement - can you please expand a bit on it? :
"The advantage of the logical container design is that the global object is the primary logical container, so you only fetch one object and get all the contained children in the same operation."
What exactly do you mean by "contained children" ? Can you give me an example given the sample application I described above?
Finally, I'd like to do a quick clarification: When I talk about primary key BTrees, I am absolutely using one BTree per business object class. So, in the example, I would have 3 primary BTrees - one each for Owner, Account and Transacation.
So, the recid in the BTree for each object is the full range of Long. If I want to achieve clustering on the Transaction index to optimize disk read times, I could develop a hashing algorithm for determining new recids for the Transaction BTree that would use components of the Account recid as it's first 16 bits (or whatever is appropriate), then uses a counter for the remaining recid bits.
This is the reason that I've been keen on implementing the MRQ at the BTree level - each primary index will have it's own MRQ.
If clustering wasn't desired, then the same behavior could be achieved by using the MRQ at the record manager level, and just storing objects directly into the manager (i.e. no primary index at all).
What I'm describing above is actually an approach that may make sense... as long as I can figure out why inserts of smallish objects directly in the recman are so slow.
We could use the logical row manager to implement clustering as part of a "pack and optimize" routine on jdbm - packing would be a matter of defragmenting the recordifle paging system. Optimizing would be a matter of re-ordering records on disk so they are in the order of a specified index.
The downside to that approach is that it would require a much better understanding of the logcal row manager, and the optimization wouldn't "just happen"...
I understand BTrees... Trying to futz with the logical row manager may not be something that we really want to get into... (Although it does seem straightforward).
From an algorithm perspective, optimization (and packing) could look something like this:
Make a copy of the logical row lookup table
Walk each index marked as primary.
For each recid returned, write the row to a new jdbm database, and zero out the value in the lookup table copy.
After all indexes have been walked, blast through the row lookup table, and copy bytes for each entry that is not '0' into the new table.
Close new jdbm database
Close old jdbm database
Delete old jdbm database
Rename new jdbm database
The above could be done in a meta-transaction (handled by file naming, instead of with the log file), so the above would be done with transactions disabled. If there was a failure, the new jdbm database would just be deleted.