On Behalf Of Kevin Day
Sent: Monday, April 10, 2006 9:36 PM
To: JDBM Developer listserv
Subject: re: [Jdbm-developer] record managment
Interesting - this is similar to something that I saw in the (cricket?)
architecture for a threaded log based db.
A couple of questions/comments that may spark some discussion points or new
1. It seems that the proposed idea would work well until records are
updated in a manner that increases their size. At that point, it seems
unlikely to me that all of the records on a given page would remain on that
page, so you'd have some entries that would overflow into other pages.
BBT: Yes, the notional design is that each page has a slot map and a data
space. The slot map records the offset of the physical record on the
If there is an overflow situation, then the slot map operates as a
translation slot and records the _slot_ of the record on a _different_ page.
At most one indirection could be performed. If all slots on a page are
indirected, then this is the same as the current translation page solution.
2. I'm pretty sure that this would preclude the ability of ever packing
shrinking the database... If I allocate tons of objects, then delete
but one on each page, I can't shrink the database or otherwise optimize it
in any way. I have been envisioning a system that always places
pages at the front of the db file, even if it means moving data pages out of
the way during an allocation. In that way, the db can always be shrunk
to the size of the number of translation pages that contain active slots.
BBT: This design could degenerate into the translation page design (per
above), but it is oriented towards reducing lookup for objects over
compaction of the store.
3. This design does allow for some form of space optimization without
incuring the cost of updating the translation table itself - which
definitely has bennefits.
4. It seems that this would reduce the efficiency of the translation
cache considerably - instead of having ~2000 slots on a page, we'd probably
wind up with ~20-50 (obviously, it depends on the size of the data
In the multi-version scenario, if a version table has to be written to disk
at all (and in the vast majority of cases it will not), this *might* force a
larger number of page writes to the safe than is strictly necessary.
the non-locality of recids, this may or may not actually be a consideration,
BBT: I do not see how it could reduce the efficiency of the translation page
caching. I think that the better analogy is that translation pages are
overhead that we can avoid in many scenarios, but that the design can
degrade to the same performance as the translation page design. Also, I
very much thinking in terms of clustering support with this design.
5. One way of addressing my concern with #2 (which is along the lines
the cricket architecture) is to use a b-tree to manage the offsets of
pages themselves. Each node in the BTree would represent the starting
of a contiguous block of pages - so if you have 50 pages on disk that are
ordered by their logical ordering, it would require a single BTree
This allows pages to be moved to arbitrary locations (and, in fact, sets you
up for infinite stores - you just have occasional checkpoints that include
an entire BTree).
That, of course, re-introduces a lookup...
6. Locality take a bit of a hit in terms of what is easy to implement
not - if an application wants to request that the recman place a set of
objects near each other, it can only do so (efficiently) in the context of
the initial inserts of the objects (and only then if the allocator assigns
recids that are near each other).
BBT: Yes, this is an interesting issue. It seems that the more
available to drive a clustering strategy the better. When an object is
first created the most information that you might expect to have on hand is
some notion of a "container". If you are able to defer until
the object has
been linked, then you can leverage the links and metadata about link classes
to make clustering decisions. I see several ways of handling this: (1)
the translation page mode of the design and defer the placement of the
physical records onto pages until more information is available. This
support clustering of physical records, but not of the OIDs. The
jdbm defers serialization and writing onto the page image in an appropriate
manner but does not use a clustering strategy to guide the choice of pages
onto which to place the physical records. (2) use a policy in the
application framework that supports persistence by reachability such that
the OID assignment may be deferred. In this case, the OIDs could be
clustered as well based. (3) use the notion of a "container"
OIDs. Objects inserted into a container will be clustered together.
BBT: This highlights a possible problem with the use of translation pages.
If the OIDs are not clustered, then a large store may demonstrate thrashing
in the translation pages since the OIDs themselves do not demonstrate
locality of reference. Clustering at persistent objects at insert
to be required to minimize disk reads since otherwise reads of translation
pages may be scattered. Here are some sample numbers.
#of objects in store: 10M
average size of serialized object: 80 bytes.
per object overhead: 8 bytes (record header) vs 11 bytes (slot size)
adjusted average object size: 88 vs 91 bytes
#of translation pages (jdbm 1.x): 12,237
average #of objects per data page: 93 vs 90
#of data pages: 107,422 vs 111,084
Total #of pages: 119,659 vs 111,084
Now I am not convinced that the translation page design is bad. It does
introduce an indirection, but if there is reasonable locality in the OIDs
then the translation pages will be cached. If we do not have locality
the OIDs, then the new design will still have to do a lot of page reads for
the physical records since they will not be effectively clustered.
The translation table design appears to be marginally more space efficient,
but this is entirely a function of the difference in the per-object overhead
(3 bytes), so little differences get magnified by the #of objects.
Of all of the above, the only one that I'm actually really concerned about
is #2 - the rest can be addressed with fancy bookkeeping (a pain in the neck
to develop, but it should be doable). I suspect that #5 is not going to
palatable, because it kind of defeats the purpose of doing this in the first
BBT: I admit that I am less concerned with compaction of the store and more
concerned with read performance and with the overhead associated with
managing free space in the store. This is still the largest hotspot in
1.x and takes 10% of the runtime for our application.
I think that the new design will provide faster management of free space on
pages. The basic principle is to divide each page into a slot map
(allocated like a stack - top down) and a data space (growing bottom up).
The data space would be dense. The free space would exist between the
of the data space and the bottom of the slot map. Operations on the
would be atomic (thread safe). Write operations could reorder the data
space and the slot map. Write operations could also force a physical
to become indirected (translated to another page) or to become direct (on
I have been re-thinking record management some. I want to share some of
thoughts here. I also want to see whether the version management
requirements for kevins MVCC module could be satisfied by such a design.
The main insight is that the existing translation page design could be
rethought as one possible extreme in a more flexible, and hopefully more
efficient, design. The basic change is that each page would incorporate
capability to store either and/or logical to physical translation slots or
directly encode the physical record. The same slot map would be reused
both purposes with appropriate coding to indicate an on-page record vs. a
record that was indirected to another page.
Setting aside the page header for the minute, the design would have records
growing up from the bottom of the page (lower address) and the slot map
growing down from the top of the page. The lower bits of the OID would
address slots in the page. Those slots would either give an on-page
or the pageId and slot off the record on another page. The data space
the slot map space on the page would be kept dense, so there would never be
free space between records. This should be aconsistent data structure
reconciliation of page images should be possible.
This design allows us to directly lookup records by the OID without
indirection. In the best case this means that we do one fetch for an
vs two. If the record is one page, then we are done. Otherwise we
and the effect is exactly as with the existing translation pages. In
extreme, if all records are indirected then the design is essentially the
same as the existing translation page design.
I am still playing around with record management rules for the page and
there are clearly many designs that are possible and clustering remains an
Got to run.
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!