Thread: RE: [Jdbm-developer] record managment

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Kevin,

Comments below.

-bryan

________________________________________
From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Kevin =
Day
Sent: Monday, April 10, 2006 9:36 PM
To: JDBM Developer listserv
Subject: re: [Jdbm-developer] record managment

Bryan-
=A0
Interesting - this is similar to something that I saw in the (cricket?)
architecture for a threaded log based db.
=A0
A couple of questions/comments that may spark some discussion points or =
new
directions:=A0=20
=A0
1.=A0 It seems that the proposed idea would work well until records are
updated in a manner that increases their size.=A0 At that point, it =
seems
unlikely to me that all of the records on a given page would remain on =
that
page, so you'd have some entries that would overflow into other pages.

BBT: Yes, the notional design is that each page has a slot map and a =
data
space.  The slot map records the offset of the physical record on the =
page.
If there is an overflow situation, then the slot map operates as a
translation slot and records the _slot_ of the record on a _different_ =
page.
At most one indirection could be performed.  If all slots on a page are
indirected, then this is the same as the current translation page =
solution.
=A0
2.=A0 I'm pretty sure that this would preclude the ability of ever =
packing and
shrinking the database...=A0 If I allocate tons of objects, then delete =
all
but one on each page, I can't shrink the database or otherwise optimize =
it
in any way.=A0 I have been envisioning a system that always places =
translation
pages at the front of the db file, even if it means moving data pages =
out of
the way during an allocation.=A0 In that way, the db can always be =
shrunk down
to the size of the number of translation pages that contain active =
slots.

BBT: This design could degenerate into the translation page design (per
above), but it is oriented towards reducing lookup for objects over
compaction of the store.
=A0
3.=A0 This design does allow for some form of space optimization =
without
incuring the cost of updating the translation table itself - which
definitely has bennefits.
=A0
4.=A0 It seems that this would reduce the efficiency of the translation =
paging
cache considerably - instead of having ~2000 slots on a page, we'd =
probably
wind up with ~20-50 (obviously, it depends on the size of the data =
record).=A0
In the multi-version scenario, if a version table has to be written to =
disk
at all (and in the vast majority of cases it will not), this *might* =
force a
larger number of page writes to the safe than is strictly necessary.=A0 =
Given
the non-locality of recids, this may or may not actually be a =
consideration,
though.

BBT: I do not see how it could reduce the efficiency of the translation =
page
caching.  I think that the better analogy is that translation pages are =
an
overhead that we can avoid in many scenarios, but that the design can
degrade to the same performance as the translation page design.  Also, =
I am
very much thinking in terms of clustering support with this design.
=A0
5.=A0 One way of addressing my concern with #2 (which is along the =
lines of
the cricket architecture)=A0is to use a b-tree to manage the offsets of =
the
pages themselves.=A0 Each node in the BTree would represent the =
starting point
of a contiguous block of pages - so if you have 50 pages on disk that =
are
ordered by their logical ordering, it would require a single BTree =
node.=A0
This allows pages to be moved to arbitrary locations (and, in fact, =
sets you
up for infinite stores - you just have occasional checkpoints that =
include
an entire BTree).
=A0
That, of course, re-introduces a lookup...
=A0
6.=A0 Locality take a bit of a hit in terms of what is easy to =
implement vs
not=A0- if an application wants to request that the recman place a set =
of
objects near each other, it can only do so (efficiently) in the context =
of
the initial inserts of the objects (and only then if the allocator =
assigns
recids that are near each other).

BBT: Yes, this is an interesting issue.  It seems that the more =
information
available to drive a clustering strategy the better.  When an object is
first created the most information that you might expect to have on =
hand is
some notion of a "container".  If you are able to defer until the =
object has
been linked, then you can leverage the links and metadata about link =
classes
to make clustering decisions.  I see several ways of handling this: (1) =
use
the translation page mode of the design and defer the placement of the
physical records onto pages until more information is available.  This =
would
support clustering of physical records, but not of the OIDs.  The =
existing
jdbm defers serialization and writing onto the page image in an =
appropriate
manner but does not use a clustering strategy to guide the choice of =
pages
onto which to place the physical records.  (2) use a policy in the
application framework that supports persistence by reachability such =
that
the OID assignment may be deferred.  In this case, the OIDs could be
clustered as well based.  (3) use the notion of a "container" for =
clustering
OIDs.  Objects inserted into a container will be clustered together.

BBT: This highlights a possible problem with the use of translation =
pages.
If the OIDs are not clustered, then a large store may demonstrate =
thrashing
in the translation pages since the OIDs themselves do not demonstrate
locality of reference.  Clustering at persistent objects at insert =
appears
to be required to minimize disk reads since otherwise reads of =
translation
pages may be scattered.  Here are some sample numbers.

#of objects in store: 10M
average size of serialized object: 80 bytes.
per object overhead: 8 bytes (record header) vs 11 bytes (slot size)
adjusted average object size: 88 vs 91 bytes
#of translation pages (jdbm 1.x): 12,237
average #of objects per data page: 93 vs 90
#of data pages: 107,422 vs 111,084
Total #of pages: 119,659 vs 111,084

Now I am not convinced that the translation page design is bad.  It =
does
introduce an indirection, but if there is reasonable locality in the =
OIDs
then the translation pages will be cached.  If we do not have locality =
in
the OIDs, then the new design will still have to do a lot of page reads =
for
the physical records since they will not be effectively clustered.

The translation table design appears to be marginally more space =
efficient,
but this is entirely a function of the difference in the per-object =
overhead
(3 bytes), so little differences get magnified by the #of objects.

Kevin wrote:
=A0
Of all of the above, the only one that I'm actually really concerned =
about
is #2 - the rest can be addressed with fancy bookkeeping (a pain in the =
neck
to develop, but it should be doable).=A0 I suspect that #5 is not going =
to be
palatable, because it kind of defeats the purpose of doing this in the =
first
place.
=A0
BBT: I admit that I am less concerned with compaction of the store and =
more
concerned with read performance and with the overhead associated with
managing free space in the store.  This is still the largest hotspot in =
jdbm
1.x and takes 10% of the runtime for our application.

I think that the new design will provide faster management of free =
space on
pages.  The basic principle is to divide each page into a slot map
(allocated like a stack - top down) and a data space (growing bottom =
up).
The data space would be dense.  The free space would exist between the =
end
of the data space and the bottom of the slot map.  Operations on the =
page
would be atomic (thread safe).  Write operations could reorder the data
space and the slot map.  Write operations could also force a physical =
record
to become indirected (translated to another page) or to become direct =
(on
page).

-bryan
=A0
- K
=A0
=A0
=A0=20

>=20
All,=20
=A0
I have been re-thinking record management some. =A0I want to share some =
of my
thoughts here. =A0I also want to see whether the version management
requirements for kevins MVCC module could be satisfied by such a =
design.=20
=A0
The main insight is that the existing translation page design could be
rethought as one possible extreme in a more flexible, and hopefully =
more
efficient, design. =A0The basic change is that each page would =
incorporate the
capability to store either and/or logical to physical translation slots =
or
directly encode the physical record. =A0The same slot map would be =
reused for
both purposes with appropriate coding to indicate an on-page record vs. =
a
record that was indirected to another page.=20
=A0
Setting aside the page header for the minute, the design would have =
records
growing up from the bottom of the page (lower address) and the slot map
growing down from the top of the page. =A0The lower bits of the OID =
would
address slots in the page. =A0Those slots would either give an on-page =
offset
or the pageId and slot off the record on another page. =A0The data =
space and
the slot map space on the page would be kept dense, so there would =
never be
free space between records. =A0This should be aconsistent data =
structure and
reconciliation of page images should be possible.=20
=A0
This design allows us to directly lookup records by the OID without
indirection. =A0In the best case this means that we do one fetch for an =
object
vs two. =A0If the record is one page, then we are done. =A0Otherwise we =
indirect
and the effect is exactly as with the existing translation pages. =A0In =
the
extreme, if all records are indirected then the design is essentially =
the
same as the existing translation page design.=20
=A0
I am still playing around with record management rules for the page and
there are clearly many designs that are possible and clustering remains =
an
important issue.=20
=A0
Got to run. =A0
=A0
-bryan <

Thread: RE: [Jdbm-developer] record managment

jdbm-developer