Thread: RE: [Jdbm-developer] logical containers and translation pages (wa s More reverse lookup info)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Kevin,

A while back I posted on the notion of "logical containers."  I don't
believe that I ever cleared up the
terminology to your satisfaction.  However, the recent thread on the
translation pages has given me
some insight as to how to use such containers so as to consume those 10
remaining bits in the
recid in the current translation page scheme.  I'll sketch out the approach
that I am suggesting.  It
has the advantage that it packs many logical records (from the application
perspective) inside of a
single physical record (from the jdbm perspective).

The basic idea is that any given physical record serves as a logical record
containing some application
data, but can also serve as a container for other logical records.  The
container behavior provides for up
to 2^N objects in the physical record, where N is the #of bits available for
addressing -- 10 in the case
of the current translation page scheme as you have outlined the current jdbm
behavior.  A flag marks
whether the physical record is currently a container or not.  When
dynamically promoted to a container,
the following data are added to the record:

 - #of logical records in the logical container (may span multiple physical
containers)
 - #of logical records in this physical record (up to 2^N).
 - recid of the first secondary container (head of a linked list of zero or
more physical records).
 - recid of the last secondary container (tail of a linked list of zero or
more physical records).
 - long[] of keys into the container (maintained in sorted order)
 - Object[] of values in the container (indexed into by finding the matching
recid in the key[]).
 - keySerializer
 - valueSerializer

As you can see the physical record is dynamically transformed into something
that is capable of
looking up the recid in an internal index.  The way this works is that the
least significant N bits of
the logical recid provided by the application are dropped in order to find
the block and offset in the
translation page that points to the physical record.  Those least
significant bits are then used to
index into the container to find the logical record (so the key[] could be
packed since the high order
bits are always the same)

The application inserts into a logical container using:

    insert( long containerHintId, Object obj )

The container hint is the recid of any object in that container.  The
serializer is the value serializer
for the container (see above).

When the address space for the physical record (those N bits) becomes full,
a secondary container
is created and objects are inserted into that container instead.

Note that the CRUD operators have basically the same cost as the original
jdbm CRUD operators
since the only indirection is the translation page, exactly like current
jdbm operations.  The advantage
is that many logical records may be packed into the same physical record,
which has the same 
benefits that you noted with using a btree to pack records together:

Multiple logical records in a single physical record means:

 - fewer, but larger, I/O transfers
 - fewer distinct physical records, so fewer objects in the cache.

I have implemented a varient on this as a RecordManager interface that
delegates to the cache and
then to the BaseRecordManager.  This works very nicely.  However, given the
discussion of the
translation pages it is clear that this technique could be migrated directly
into jdbm which would
give us full utilization of those unused address bits.

So, when considering design alternatives for the translation pages I think
that this approach can be
used to regain the full logical address space.   The other consideration, of
couse, is changing the
translation pages in order to support reordering of the pages within the
store.

-bryan

-----Original Message-----
From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Kevin Day
Sent: Wednesday, September 28, 2005 2:12 AM
To: JDBM Developer listserv
Subject: [Jdbm-developer] More reverse lookup info

All-

Here are my thoughts for the evening - I'm looking forward to your responses
and ideas...

Logical row id reverse lookup redux
--------------------------------------------------
Based on feedback from Markus, I went back and looked at the logical row
ids.  It turns out that the limit on the maximum logical row id has more to
do with the 8 byte representation and how it maps to block/offset than it
does to the amount of space actually available in the file.

By taking advantage of this, and the fact that all logical rowids are
divisible by 10 (because of the slot size in the translation page) I can get
the minimum bytes to capture a logical row id down to 6 (Actually, 44 bits -
which leaves a few bits for marking free and previous-free state to support
coalescing).  I'm pretty sure I've maxed it out at this point, though.

Even with taking advantage of splitting records (to reduce the 4 byte size
field to a 1 byte 'unused' field), I still can't get a record header smaller
than 11 bytes.

This is going to force us to either:

1.  Maintain a separate table with reverse lookup
or
2.  Change the binary format of the file (which will require some sort of
conversion procedure)

I'm not keen on either, but I really don't see any alternative - hopefully
someone else will?

Conversion / Off-line pack and rebuild
-----------------------------------------------------
I have some thoughts on a conversion procedure - I'm going to put them down
here to see if anyone has any comments on this (note that this may make for
an off-line pack and rebuild strategy as well):

1.  Pull the length of the first logical row id's record
2.  Allocate space from the available page list to store this record
(probably use the used space of the record, not avaliable space), including
the new header size (allocate new pages as necessary)
3.  Insert that record at the new location
4.  Update the translation page for the new record location, and insert the
logical row id into the new record header
5.  Mark the old record as free
6.  Coallesce adjacent free records on either side of the old record, if
possible
7.  If the page containing the old record (or the result of coallescing the
old record) now contains only free record data, move it to the free list
8.  Update the value of the last logical row id to be processed (important
for restarting without data corruption)
9.  Every XX loops, call commit()
10.  Continue with the next logical row id and repeat

11.  After all logical row ids have been processed, reclaim any free pages
that are now located at the end of the file

A couple of comments on this strategy:

First, we need to have some mechanism for pulling free pages that are as
close to the front of the file as possible.  We could order the list, or add
parent, left and right pointers to the free page and implement a binary
heap, but maybe there is some other mechanism that would be more efficient?

Second, we would probably want to also move other page types (free logical
and physical row id pages come to mind - but I think we can get rid of those
entirely, so maybe data pages are the only ones we need to worry about after
all.

Third, the coallescing step is critical to this strategy working as it will
free up pages.  Otherwise, we could wind up with a bunch of unused free
pages at the end of the process.

Fourth, this strategy may work nicely as a general pack & rebuild operation
(even if we aren't changing record header sizes).

Fifth, this may even work as an on-line pack & rebuild operation (we still
need the reverse lookup to allow us to pack translation pages to the front
of the file, though, so this doesn't remove that need - see my comments in
the next section "Translation Page Strategy" for what I'm thinking here).
Basically, we could have a low priority thread that does one of the above
loops, then goes to sleep.  As long as the operation is synchronized
properly, it should be possible to do this while the database is actually
being used.  The other approach would be to move through the loop with some
duty cycle (i.e. every X inserts, run the loop), but doing it in a separate
thread allows the pack to occur when the database isn't actively being used,
so it won't effect performance.

Sixth, we would need to make sure that we don't move records around
unnecesarily.  I think this is just a matter of only moving records if they
belong to blocks that have unused space in them (allowing for the special
condition where there can be 16 bytes + 1 record header at the end of the
page).

Is there anything else I might be missing?

Translation Page Strategy
------------------------------------

By the way (also based on Markus' comments), I went back and spent some time
thinking about the new translation page strategy.  I think that using a tree
structure like I was proposing is *not* the way to go.  Instead, I think we
should adopt a policy of always placing new translation pages near the front
of the file, and that we never move a tranlation page once it is allocated.
Maintaining a single page read for translation is going to be important, and
I think we can gain the same desired benefit (i.e. creating a system where
the translation pages don't prevent us from packing free space from the
file) and still maintain O(1) lookups (and less disk space overhead as
well).

Any older files would continue to work as normal, but can never be smaller
than the largest sized translation page that was allocated before the new
code was put in place.

In order for this to work, we will need to know the physical location of the
largest translation page (excluding pages added before this strategy
change).  When a new translation page is requested, we go to the physical
page immediately following, and move it's contents into a free page
somewhere later in the file.  We then mark the recently freed page as a
translation page, link it to the correct page list, and start using it.

There are some details about how to properly handle freed translation pages,
but this strategy seems like it would work.  Basically, we recognize that we
can't practically move translation pages around, so we make sure we place
them initially where we want them to be.

For this to work, though, we have to be able to move data pages around
on-demand - and the only way that's going to happen is if we have a reverse
lookup to fnd the logical row id for a given record.

Stumped on how best to do this...

- K
------------------------------------------------------- This SF.Net email is
sponsored by: Power Architecture Resource Center: Free content, downloads,
discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________ Jdbm-developer mailing list
Jdb...@li...
https://lists.sourceforge.net/lists/listinfo/jdbm-developer