From: Thompson, B. B. <BRY...@sa...> - 2006-04-11 12:14:12
|
I am wondering if we could define the record management API in terms of = an OID interface and then provide for delayed conversion of OIDs to 64-bit integers. After insert but before conversion the OID would "float". = The conversion would happen when the object was evicted from cache or the transaction commits. At this point the OID would be assigned and = locked for the object. I think that this would make it possible to cluster OIDs = based on the "mature" state of the object. Internally, the OID object would maintain a long integer. We need to differentiate an OID which is NULL (0L) from one which is floating = (-1L?). All caching operations for records would be defined in terms of OIDs. I would suggest a factory for OIDs. Different stores might choose = different pages sizes, which would mean a different #of bits reserved for the = slot# on the page vs the pageId. That context would have to come from a factory associated with the store. -bryan -----Original Message----- From: jdb...@li... [mailto:jdb...@li...] On Behalf Of = Thompson, Bryan B. Sent: Tuesday, April 11, 2006 6:32 AM To: Kevin Day; JDBM Developer listserv Subject: RE: [Jdbm-developer] record managment Kevin, Comments below. -bryan ________________________________________ From: jdb...@li... [mailto:jdb...@li...] On Behalf Of Kevin = Day Sent: Monday, April 10, 2006 9:36 PM To: JDBM Developer listserv Subject: re: [Jdbm-developer] record managment Bryan- =A0 Interesting - this is similar to something that I saw in the (cricket?) architecture for a threaded log based db. =A0 A couple of questions/comments that may spark some discussion points or = new directions:=A0=20 =A0 1.=A0 It seems that the proposed idea would work well until records are updated in a manner that increases their size.=A0 At that point, it = seems unlikely to me that all of the records on a given page would remain on = that page, so you'd have some entries that would overflow into other pages. BBT: Yes, the notional design is that each page has a slot map and a = data space. The slot map records the offset of the physical record on the = page. If there is an overflow situation, then the slot map operates as a translation slot and records the _slot_ of the record on a _different_ = page. At most one indirection could be performed. If all slots on a page are indirected, then this is the same as the current translation page = solution. =A0 2.=A0 I'm pretty sure that this would preclude the ability of ever = packing and shrinking the database...=A0 If I allocate tons of objects, then delete = all but one on each page, I can't shrink the database or otherwise optimize = it in any way.=A0 I have been envisioning a system that always places = translation pages at the front of the db file, even if it means moving data pages = out of the way during an allocation.=A0 In that way, the db can always be = shrunk down to the size of the number of translation pages that contain active = slots. BBT: This design could degenerate into the translation page design (per above), but it is oriented towards reducing lookup for objects over compaction of the store. =A0 3.=A0 This design does allow for some form of space optimization = without incuring the cost of updating the translation table itself - which definitely has bennefits. =A0 4.=A0 It seems that this would reduce the efficiency of the translation = paging cache considerably - instead of having ~2000 slots on a page, we'd = probably wind up with ~20-50 (obviously, it depends on the size of the data = record).=A0 In the multi-version scenario, if a version table has to be written to = disk at all (and in the vast majority of cases it will not), this *might* = force a larger number of page writes to the safe than is strictly necessary.=A0 = Given the non-locality of recids, this may or may not actually be a = consideration, though. BBT: I do not see how it could reduce the efficiency of the translation = page caching. I think that the better analogy is that translation pages are = an overhead that we can avoid in many scenarios, but that the design can degrade to the same performance as the translation page design. Also, = I am very much thinking in terms of clustering support with this design. =A0 5.=A0 One way of addressing my concern with #2 (which is along the = lines of the cricket architecture)=A0is to use a b-tree to manage the offsets of = the pages themselves.=A0 Each node in the BTree would represent the = starting point of a contiguous block of pages - so if you have 50 pages on disk that = are ordered by their logical ordering, it would require a single BTree = node.=A0 This allows pages to be moved to arbitrary locations (and, in fact, = sets you up for infinite stores - you just have occasional checkpoints that = include an entire BTree). =A0 That, of course, re-introduces a lookup... =A0 6.=A0 Locality take a bit of a hit in terms of what is easy to = implement vs not=A0- if an application wants to request that the recman place a set = of objects near each other, it can only do so (efficiently) in the context = of the initial inserts of the objects (and only then if the allocator = assigns recids that are near each other). BBT: Yes, this is an interesting issue. It seems that the more = information available to drive a clustering strategy the better. When an object is first created the most information that you might expect to have on = hand is some notion of a "container". If you are able to defer until the = object has been linked, then you can leverage the links and metadata about link = classes to make clustering decisions. I see several ways of handling this: (1) = use the translation page mode of the design and defer the placement of the physical records onto pages until more information is available. This = would support clustering of physical records, but not of the OIDs. The = existing jdbm defers serialization and writing onto the page image in an = appropriate manner but does not use a clustering strategy to guide the choice of = pages onto which to place the physical records. (2) use a policy in the application framework that supports persistence by reachability such = that the OID assignment may be deferred. In this case, the OIDs could be clustered as well based. (3) use the notion of a "container" for = clustering OIDs. Objects inserted into a container will be clustered together. BBT: This highlights a possible problem with the use of translation = pages. If the OIDs are not clustered, then a large store may demonstrate = thrashing in the translation pages since the OIDs themselves do not demonstrate locality of reference. Clustering at persistent objects at insert = appears to be required to minimize disk reads since otherwise reads of = translation pages may be scattered. Here are some sample numbers. #of objects in store: 10M average size of serialized object: 80 bytes. per object overhead: 8 bytes (record header) vs 11 bytes (slot size) adjusted average object size: 88 vs 91 bytes #of translation pages (jdbm 1.x): 12,237 average #of objects per data page: 93 vs 90 #of data pages: 107,422 vs 111,084 Total #of pages: 119,659 vs 111,084 Now I am not convinced that the translation page design is bad. It = does introduce an indirection, but if there is reasonable locality in the = OIDs then the translation pages will be cached. If we do not have locality = in the OIDs, then the new design will still have to do a lot of page reads = for the physical records since they will not be effectively clustered. The translation table design appears to be marginally more space = efficient, but this is entirely a function of the difference in the per-object = overhead (3 bytes), so little differences get magnified by the #of objects. Kevin wrote: =A0 Of all of the above, the only one that I'm actually really concerned = about is #2 - the rest can be addressed with fancy bookkeeping (a pain in the = neck to develop, but it should be doable).=A0 I suspect that #5 is not going = to be palatable, because it kind of defeats the purpose of doing this in the = first place. =A0 BBT: I admit that I am less concerned with compaction of the store and = more concerned with read performance and with the overhead associated with managing free space in the store. This is still the largest hotspot in = jdbm 1.x and takes 10% of the runtime for our application. I think that the new design will provide faster management of free = space on pages. The basic principle is to divide each page into a slot map (allocated like a stack - top down) and a data space (growing bottom = up). The data space would be dense. The free space would exist between the = end of the data space and the bottom of the slot map. Operations on the = page would be atomic (thread safe). Write operations could reorder the data space and the slot map. Write operations could also force a physical = record to become indirected (translated to another page) or to become direct = (on page). -bryan =A0 - K =A0 =A0 =A0=20 >=20 All,=20 =A0 I have been re-thinking record management some. =A0I want to share some = of my thoughts here. =A0I also want to see whether the version management requirements for kevins MVCC module could be satisfied by such a = design.=20 =A0 The main insight is that the existing translation page design could be rethought as one possible extreme in a more flexible, and hopefully = more efficient, design. =A0The basic change is that each page would = incorporate the capability to store either and/or logical to physical translation slots = or directly encode the physical record. =A0The same slot map would be = reused for both purposes with appropriate coding to indicate an on-page record vs. = a record that was indirected to another page.=20 =A0 Setting aside the page header for the minute, the design would have = records growing up from the bottom of the page (lower address) and the slot map growing down from the top of the page. =A0The lower bits of the OID = would address slots in the page. =A0Those slots would either give an on-page = offset or the pageId and slot off the record on another page. =A0The data = space and the slot map space on the page would be kept dense, so there would = never be free space between records. =A0This should be aconsistent data = structure and reconciliation of page images should be possible.=20 =A0 This design allows us to directly lookup records by the OID without indirection. =A0In the best case this means that we do one fetch for an = object vs two. =A0If the record is one page, then we are done. =A0Otherwise we = indirect and the effect is exactly as with the existing translation pages. =A0In = the extreme, if all records are indirected then the design is essentially = the same as the existing translation page design.=20 =A0 I am still playing around with record management rules for the page and there are clearly many designs that are possible and clustering remains = an important issue.=20 =A0 Got to run. =A0 =A0 -bryan < ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting = language that extends applications into web and mobile media. Attend the live = webcast and join the prime developer group breaking into this new coding = territory! http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=110944&bid$1720&dat=121642 _______________________________________________ Jdbm-developer mailing list Jdb...@li... https://lists.sourceforge.net/lists/listinfo/jdbm-developer |