Sounds promising.  I think that there is a lot of room in these designs for optimizations and that the various twists may each benefit different applications.  I like the notion that you could do shadow record copying for MVCC with this scheme.  Is there any reason why subsuming the translation entries into the pages would not also work for that purpose or is it just that you would be constrained to not being able to collesce pages in the store?  For many applications, there will not be any need to collese pages in the store, which is why I think that is an interesting design to explore.
Also, an application that is forthright about naming access to records that are externally visible and has transparency for references (and reverse references) could compact a store using a copy and sweep mechanism very similar to java GC.
-----Original Message-----
From: []On Behalf Of Kevin Day
Sent: Wednesday, April 26, 2006 12:57 PM
To: JDBM Developer listserv
Subject: [Jdbm-developer] Thoughts on Bryan's proposal for allowing records to move inside a page...

Spent some time crunching on an idea that Bryan proposed a couple of weeks ago (having a table in each page that points to the starting location of records on the page).  Bryan's original idea had to do with getting rid of the translation system (or more accurately, subsuming the translation system into the data pages themselves).


At the time, I had a couple of hesitations about the idea (the biggest of which being that it would be impossible to move pages around).


After noodling on this a bit more, I'd like to propose an idea that gives us the advantages of what Bryan is proposing, and may help cache utilization and performance quite a bit...


The basic idea would be to maintain the transaltion table to map logical record ids to physical records.  However, instead of having the physical record id consist of a physical page and a physical offset on that page, we have the physical record id consist of the physical page, and a logical slot number for the record on the page.


The structure of a data page would be similar to what Bryan is proposing:  Data gets added from byte 0 onward, slot information gets added from PAGESIZE downward.  The slot information can consist of the information that would normally reside in the record header (length, and logical recid reverse lookup).  This arrangement will consume the same amount of disk space, but allows for a lot of re-organization of records inside a given page without having to update the translation table.


If we had this kind of setup, it would probably be possible to do away with the version table pages in my proof of concept, and instead do a version of shadow *record* copying (as opposed to shadow page copying).


If we have a normal (non VLR) transaction, the process for an update would be:


1.  Copy the before image of the RECORD to clean storage (this could be on the same page if there is room, or on a different page altogether)

2.  Lay a version table down in the slot of the before image

3.  Copy the after image of the record to clean storage

4.  Write dirty pages to the safe


some time would pass, after which the old version of the record is no longer needed (the new version becomes the 'earliest' version), at which point:


5.  Copy the after imageof the record to the location of the version table if possible

6.  Write dirty pages to the safe



Note that the translation table doesn't have to be updated at all for the above.  Also note that all of the page write operations are happening on pages in the cache/safe, and that the purging of the old version can occur in the context of the page management in the safe (i.e. a really good time to do version purges, etc... is during the purge of pages from the safe).


If the new version won't fit on the same page as the old version in the second operation, then the translation table would have to be updated, but that is to be expected.


One further optimization would be to have the ability (via a bit flag in the version table itself) to instruct jdbm's behavior in step 5 to allow for re-arrangement and clustering of records by the application.  If the flag is 0, we try to copy the after image to the location of the version table.  If the flag is 1, we leave the after image in place and update the translation table instead.


I see the advantages of the above approach as:


1.  Locality - the majority of the changes will occur on the same pages, and those pages will already be in cache.  This significantly alleviates the issue of churn in the translation sub-system.

2.  Coallescing of free space on a page is significantly simplified.  Each page will always have a chunk of free space in it's center.  An in-memory allocator can be used to determine which pages should be used for a given storage request.  In other words, we don't have to track free space per row - we track it per page.

3.  This should help concurrency a bit.  Because we are greatly reducing the interaction with the translation table, we'll be avoiding any synchronization constructs operating on that sub-system.

4.  Records can still be moved arbitrarily in the db file to optimize clustering, make translation pages available, pack and rebuild, etc...


Some outstanding questions that I haven't thought through all the way:


a)  What happens with records that span multiple pages?  I think this can be easily managed, but I want to make sure

b)  ???? others ???





- K

------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo _______________________________________________ Jdbm-developer mailing list