From: Thompson, B. B. <BRY...@sa...> - 2006-04-11 10:33:38
|
Kevin, Comments below. -bryan ________________________________________ From: jdb...@li... [mailto:jdb...@li...] On Behalf Of Kevin = Day Sent: Monday, April 10, 2006 9:36 PM To: JDBM Developer listserv Subject: re: [Jdbm-developer] record managment Bryan- =A0 Interesting - this is similar to something that I saw in the (cricket?) architecture for a threaded log based db. =A0 A couple of questions/comments that may spark some discussion points or = new directions:=A0=20 =A0 1.=A0 It seems that the proposed idea would work well until records are updated in a manner that increases their size.=A0 At that point, it = seems unlikely to me that all of the records on a given page would remain on = that page, so you'd have some entries that would overflow into other pages. BBT: Yes, the notional design is that each page has a slot map and a = data space. The slot map records the offset of the physical record on the = page. If there is an overflow situation, then the slot map operates as a translation slot and records the _slot_ of the record on a _different_ = page. At most one indirection could be performed. If all slots on a page are indirected, then this is the same as the current translation page = solution. =A0 2.=A0 I'm pretty sure that this would preclude the ability of ever = packing and shrinking the database...=A0 If I allocate tons of objects, then delete = all but one on each page, I can't shrink the database or otherwise optimize = it in any way.=A0 I have been envisioning a system that always places = translation pages at the front of the db file, even if it means moving data pages = out of the way during an allocation.=A0 In that way, the db can always be = shrunk down to the size of the number of translation pages that contain active = slots. BBT: This design could degenerate into the translation page design (per above), but it is oriented towards reducing lookup for objects over compaction of the store. =A0 3.=A0 This design does allow for some form of space optimization = without incuring the cost of updating the translation table itself - which definitely has bennefits. =A0 4.=A0 It seems that this would reduce the efficiency of the translation = paging cache considerably - instead of having ~2000 slots on a page, we'd = probably wind up with ~20-50 (obviously, it depends on the size of the data = record).=A0 In the multi-version scenario, if a version table has to be written to = disk at all (and in the vast majority of cases it will not), this *might* = force a larger number of page writes to the safe than is strictly necessary.=A0 = Given the non-locality of recids, this may or may not actually be a = consideration, though. BBT: I do not see how it could reduce the efficiency of the translation = page caching. I think that the better analogy is that translation pages are = an overhead that we can avoid in many scenarios, but that the design can degrade to the same performance as the translation page design. Also, = I am very much thinking in terms of clustering support with this design. =A0 5.=A0 One way of addressing my concern with #2 (which is along the = lines of the cricket architecture)=A0is to use a b-tree to manage the offsets of = the pages themselves.=A0 Each node in the BTree would represent the = starting point of a contiguous block of pages - so if you have 50 pages on disk that = are ordered by their logical ordering, it would require a single BTree = node.=A0 This allows pages to be moved to arbitrary locations (and, in fact, = sets you up for infinite stores - you just have occasional checkpoints that = include an entire BTree). =A0 That, of course, re-introduces a lookup... =A0 6.=A0 Locality take a bit of a hit in terms of what is easy to = implement vs not=A0- if an application wants to request that the recman place a set = of objects near each other, it can only do so (efficiently) in the context = of the initial inserts of the objects (and only then if the allocator = assigns recids that are near each other). BBT: Yes, this is an interesting issue. It seems that the more = information available to drive a clustering strategy the better. When an object is first created the most information that you might expect to have on = hand is some notion of a "container". If you are able to defer until the = object has been linked, then you can leverage the links and metadata about link = classes to make clustering decisions. I see several ways of handling this: (1) = use the translation page mode of the design and defer the placement of the physical records onto pages until more information is available. This = would support clustering of physical records, but not of the OIDs. The = existing jdbm defers serialization and writing onto the page image in an = appropriate manner but does not use a clustering strategy to guide the choice of = pages onto which to place the physical records. (2) use a policy in the application framework that supports persistence by reachability such = that the OID assignment may be deferred. In this case, the OIDs could be clustered as well based. (3) use the notion of a "container" for = clustering OIDs. Objects inserted into a container will be clustered together. BBT: This highlights a possible problem with the use of translation = pages. If the OIDs are not clustered, then a large store may demonstrate = thrashing in the translation pages since the OIDs themselves do not demonstrate locality of reference. Clustering at persistent objects at insert = appears to be required to minimize disk reads since otherwise reads of = translation pages may be scattered. Here are some sample numbers. #of objects in store: 10M average size of serialized object: 80 bytes. per object overhead: 8 bytes (record header) vs 11 bytes (slot size) adjusted average object size: 88 vs 91 bytes #of translation pages (jdbm 1.x): 12,237 average #of objects per data page: 93 vs 90 #of data pages: 107,422 vs 111,084 Total #of pages: 119,659 vs 111,084 Now I am not convinced that the translation page design is bad. It = does introduce an indirection, but if there is reasonable locality in the = OIDs then the translation pages will be cached. If we do not have locality = in the OIDs, then the new design will still have to do a lot of page reads = for the physical records since they will not be effectively clustered. The translation table design appears to be marginally more space = efficient, but this is entirely a function of the difference in the per-object = overhead (3 bytes), so little differences get magnified by the #of objects. Kevin wrote: =A0 Of all of the above, the only one that I'm actually really concerned = about is #2 - the rest can be addressed with fancy bookkeeping (a pain in the = neck to develop, but it should be doable).=A0 I suspect that #5 is not going = to be palatable, because it kind of defeats the purpose of doing this in the = first place. =A0 BBT: I admit that I am less concerned with compaction of the store and = more concerned with read performance and with the overhead associated with managing free space in the store. This is still the largest hotspot in = jdbm 1.x and takes 10% of the runtime for our application. I think that the new design will provide faster management of free = space on pages. The basic principle is to divide each page into a slot map (allocated like a stack - top down) and a data space (growing bottom = up). The data space would be dense. The free space would exist between the = end of the data space and the bottom of the slot map. Operations on the = page would be atomic (thread safe). Write operations could reorder the data space and the slot map. Write operations could also force a physical = record to become indirected (translated to another page) or to become direct = (on page). -bryan =A0 - K =A0 =A0 =A0=20 >=20 All,=20 =A0 I have been re-thinking record management some. =A0I want to share some = of my thoughts here. =A0I also want to see whether the version management requirements for kevins MVCC module could be satisfied by such a = design.=20 =A0 The main insight is that the existing translation page design could be rethought as one possible extreme in a more flexible, and hopefully = more efficient, design. =A0The basic change is that each page would = incorporate the capability to store either and/or logical to physical translation slots = or directly encode the physical record. =A0The same slot map would be = reused for both purposes with appropriate coding to indicate an on-page record vs. = a record that was indirected to another page.=20 =A0 Setting aside the page header for the minute, the design would have = records growing up from the bottom of the page (lower address) and the slot map growing down from the top of the page. =A0The lower bits of the OID = would address slots in the page. =A0Those slots would either give an on-page = offset or the pageId and slot off the record on another page. =A0The data = space and the slot map space on the page would be kept dense, so there would = never be free space between records. =A0This should be aconsistent data = structure and reconciliation of page images should be possible.=20 =A0 This design allows us to directly lookup records by the OID without indirection. =A0In the best case this means that we do one fetch for an = object vs two. =A0If the record is one page, then we are done. =A0Otherwise we = indirect and the effect is exactly as with the existing translation pages. =A0In = the extreme, if all records are indirected then the design is essentially = the same as the existing translation page design.=20 =A0 I am still playing around with record management rules for the page and there are clearly many designs that are possible and clustering remains = an important issue.=20 =A0 Got to run. =A0 =A0 -bryan < |
From: Thompson, B. B. <BRY...@sa...> - 2006-04-11 12:14:12
|
I am wondering if we could define the record management API in terms of = an OID interface and then provide for delayed conversion of OIDs to 64-bit integers. After insert but before conversion the OID would "float". = The conversion would happen when the object was evicted from cache or the transaction commits. At this point the OID would be assigned and = locked for the object. I think that this would make it possible to cluster OIDs = based on the "mature" state of the object. Internally, the OID object would maintain a long integer. We need to differentiate an OID which is NULL (0L) from one which is floating = (-1L?). All caching operations for records would be defined in terms of OIDs. I would suggest a factory for OIDs. Different stores might choose = different pages sizes, which would mean a different #of bits reserved for the = slot# on the page vs the pageId. That context would have to come from a factory associated with the store. -bryan -----Original Message----- From: jdb...@li... [mailto:jdb...@li...] On Behalf Of = Thompson, Bryan B. Sent: Tuesday, April 11, 2006 6:32 AM To: Kevin Day; JDBM Developer listserv Subject: RE: [Jdbm-developer] record managment Kevin, Comments below. -bryan ________________________________________ From: jdb...@li... [mailto:jdb...@li...] On Behalf Of Kevin = Day Sent: Monday, April 10, 2006 9:36 PM To: JDBM Developer listserv Subject: re: [Jdbm-developer] record managment Bryan- =A0 Interesting - this is similar to something that I saw in the (cricket?) architecture for a threaded log based db. =A0 A couple of questions/comments that may spark some discussion points or = new directions:=A0=20 =A0 1.=A0 It seems that the proposed idea would work well until records are updated in a manner that increases their size.=A0 At that point, it = seems unlikely to me that all of the records on a given page would remain on = that page, so you'd have some entries that would overflow into other pages. BBT: Yes, the notional design is that each page has a slot map and a = data space. The slot map records the offset of the physical record on the = page. If there is an overflow situation, then the slot map operates as a translation slot and records the _slot_ of the record on a _different_ = page. At most one indirection could be performed. If all slots on a page are indirected, then this is the same as the current translation page = solution. =A0 2.=A0 I'm pretty sure that this would preclude the ability of ever = packing and shrinking the database...=A0 If I allocate tons of objects, then delete = all but one on each page, I can't shrink the database or otherwise optimize = it in any way.=A0 I have been envisioning a system that always places = translation pages at the front of the db file, even if it means moving data pages = out of the way during an allocation.=A0 In that way, the db can always be = shrunk down to the size of the number of translation pages that contain active = slots. BBT: This design could degenerate into the translation page design (per above), but it is oriented towards reducing lookup for objects over compaction of the store. =A0 3.=A0 This design does allow for some form of space optimization = without incuring the cost of updating the translation table itself - which definitely has bennefits. =A0 4.=A0 It seems that this would reduce the efficiency of the translation = paging cache considerably - instead of having ~2000 slots on a page, we'd = probably wind up with ~20-50 (obviously, it depends on the size of the data = record).=A0 In the multi-version scenario, if a version table has to be written to = disk at all (and in the vast majority of cases it will not), this *might* = force a larger number of page writes to the safe than is strictly necessary.=A0 = Given the non-locality of recids, this may or may not actually be a = consideration, though. BBT: I do not see how it could reduce the efficiency of the translation = page caching. I think that the better analogy is that translation pages are = an overhead that we can avoid in many scenarios, but that the design can degrade to the same performance as the translation page design. Also, = I am very much thinking in terms of clustering support with this design. =A0 5.=A0 One way of addressing my concern with #2 (which is along the = lines of the cricket architecture)=A0is to use a b-tree to manage the offsets of = the pages themselves.=A0 Each node in the BTree would represent the = starting point of a contiguous block of pages - so if you have 50 pages on disk that = are ordered by their logical ordering, it would require a single BTree = node.=A0 This allows pages to be moved to arbitrary locations (and, in fact, = sets you up for infinite stores - you just have occasional checkpoints that = include an entire BTree). =A0 That, of course, re-introduces a lookup... =A0 6.=A0 Locality take a bit of a hit in terms of what is easy to = implement vs not=A0- if an application wants to request that the recman place a set = of objects near each other, it can only do so (efficiently) in the context = of the initial inserts of the objects (and only then if the allocator = assigns recids that are near each other). BBT: Yes, this is an interesting issue. It seems that the more = information available to drive a clustering strategy the better. When an object is first created the most information that you might expect to have on = hand is some notion of a "container". If you are able to defer until the = object has been linked, then you can leverage the links and metadata about link = classes to make clustering decisions. I see several ways of handling this: (1) = use the translation page mode of the design and defer the placement of the physical records onto pages until more information is available. This = would support clustering of physical records, but not of the OIDs. The = existing jdbm defers serialization and writing onto the page image in an = appropriate manner but does not use a clustering strategy to guide the choice of = pages onto which to place the physical records. (2) use a policy in the application framework that supports persistence by reachability such = that the OID assignment may be deferred. In this case, the OIDs could be clustered as well based. (3) use the notion of a "container" for = clustering OIDs. Objects inserted into a container will be clustered together. BBT: This highlights a possible problem with the use of translation = pages. If the OIDs are not clustered, then a large store may demonstrate = thrashing in the translation pages since the OIDs themselves do not demonstrate locality of reference. Clustering at persistent objects at insert = appears to be required to minimize disk reads since otherwise reads of = translation pages may be scattered. Here are some sample numbers. #of objects in store: 10M average size of serialized object: 80 bytes. per object overhead: 8 bytes (record header) vs 11 bytes (slot size) adjusted average object size: 88 vs 91 bytes #of translation pages (jdbm 1.x): 12,237 average #of objects per data page: 93 vs 90 #of data pages: 107,422 vs 111,084 Total #of pages: 119,659 vs 111,084 Now I am not convinced that the translation page design is bad. It = does introduce an indirection, but if there is reasonable locality in the = OIDs then the translation pages will be cached. If we do not have locality = in the OIDs, then the new design will still have to do a lot of page reads = for the physical records since they will not be effectively clustered. The translation table design appears to be marginally more space = efficient, but this is entirely a function of the difference in the per-object = overhead (3 bytes), so little differences get magnified by the #of objects. Kevin wrote: =A0 Of all of the above, the only one that I'm actually really concerned = about is #2 - the rest can be addressed with fancy bookkeeping (a pain in the = neck to develop, but it should be doable).=A0 I suspect that #5 is not going = to be palatable, because it kind of defeats the purpose of doing this in the = first place. =A0 BBT: I admit that I am less concerned with compaction of the store and = more concerned with read performance and with the overhead associated with managing free space in the store. This is still the largest hotspot in = jdbm 1.x and takes 10% of the runtime for our application. I think that the new design will provide faster management of free = space on pages. The basic principle is to divide each page into a slot map (allocated like a stack - top down) and a data space (growing bottom = up). The data space would be dense. The free space would exist between the = end of the data space and the bottom of the slot map. Operations on the = page would be atomic (thread safe). Write operations could reorder the data space and the slot map. Write operations could also force a physical = record to become indirected (translated to another page) or to become direct = (on page). -bryan =A0 - K =A0 =A0 =A0=20 >=20 All,=20 =A0 I have been re-thinking record management some. =A0I want to share some = of my thoughts here. =A0I also want to see whether the version management requirements for kevins MVCC module could be satisfied by such a = design.=20 =A0 The main insight is that the existing translation page design could be rethought as one possible extreme in a more flexible, and hopefully = more efficient, design. =A0The basic change is that each page would = incorporate the capability to store either and/or logical to physical translation slots = or directly encode the physical record. =A0The same slot map would be = reused for both purposes with appropriate coding to indicate an on-page record vs. = a record that was indirected to another page.=20 =A0 Setting aside the page header for the minute, the design would have = records growing up from the bottom of the page (lower address) and the slot map growing down from the top of the page. =A0The lower bits of the OID = would address slots in the page. =A0Those slots would either give an on-page = offset or the pageId and slot off the record on another page. =A0The data = space and the slot map space on the page would be kept dense, so there would = never be free space between records. =A0This should be aconsistent data = structure and reconciliation of page images should be possible.=20 =A0 This design allows us to directly lookup records by the OID without indirection. =A0In the best case this means that we do one fetch for an = object vs two. =A0If the record is one page, then we are done. =A0Otherwise we = indirect and the effect is exactly as with the existing translation pages. =A0In = the extreme, if all records are indirected then the design is essentially = the same as the existing translation page design.=20 =A0 I am still playing around with record management rules for the page and there are clearly many designs that are possible and clustering remains = an important issue.=20 =A0 Got to run. =A0 =A0 -bryan < ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting = language that extends applications into web and mobile media. Attend the live = webcast and join the prime developer group breaking into this new coding = territory! http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=110944&bid$1720&dat=121642 _______________________________________________ Jdbm-developer mailing list Jdb...@li... https://lists.sourceforge.net/lists/listinfo/jdbm-developer |
From: Kevin D. <ke...@tr...> - 2006-04-11 15:56:16
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma> <DIV><FONT face=Arial size=2>Bryan-</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT></DIV> <DIV><FONT face=Arial size=2>Given that the insert() operation returns a recid that is then used extensively by other objects in the store, how do you see this working? Are you talking about not having a long recid that is returned for object references, but instead having some sort of 'reference' object? I do this in my applications, but you have to handle serialization specially for the reference objects...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>- K</FONT></DIV> <DIV><FONT face=Arial size=2> </FONT> <TABLE> <TBODY> <TR> <TD width=1 bgColor=blue><FONT face=Arial size=2></FONT></TD> <TD><FONT face=Arial size=2><FONT color=red>> I am wondering if we could define the record management API in terms of an<BR>OID interface and then provide for delayed conversion of OIDs to 64-bit<BR>integers. After insert but before conversion the OID would "float". The<BR>conversion would happen when the object was evicted from cache or the<BR>transaction commits. At this point the OID would be assigned and locked for<BR>the object. I think that this would make it possible to cluster OIDs based<BR>on the "mature" state of the object.<BR><BR>Internally, the OID object would maintain a long integer. We need to<BR>differentiate an OID which is NULL (0L) from one which is floating (-1L?).<BR>All caching operations for records would be defined in terms of OIDs.<BR><BR>I would suggest a factory for OIDs. Different stores might choose different<BR>pages sizes, which would mean a different #of bits reserved for the slot# on<BR>the page vs the pageId. That context would have to come from a factory<BR>associated with the store.<BR><BR>-bryan<BR><BR>-----Original Message-----<BR>From: <A href="mailto:jdb...@li..."><FONT color=#0000ff>jdb...@li...</FONT></A><BR><A href="mailto:jdb...@li..."><FONT color=#0000ff>[mailto:jdb...@li...]</FONT></A> On Behalf Of Thompson,<BR>Bryan B.<BR>Sent: Tuesday, April 11, 2006 6:32 AM<BR>To: Kevin Day; JDBM Developer listserv<BR>Subject: RE: [Jdbm-developer] record managment<BR><BR>Kevin,<BR><BR>Comments below.<BR><BR>-bryan<BR><BR><<BR></FONT></FONT></TD></TR></TBODY></TABLE></DIV></FONT></BODY></HTML> |
From: Kevin D. <ke...@tr...> - 2006-04-11 21:44:21
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma> <DIV><FONT face=Arial size=2>Bryan-</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT></DIV> <DIV><FONT face=Arial size=2><FONT color=#008000>I certainly agree that writing to a clean page is a dead simple strategy when you are doing a bunch of inserts and makes it especially easy to cluster objects since there is only one tx doing the allocation and with data on the page. </FONT><BR></FONT></DIV> <DIV><FONT face=Arial size=2>KD - What is unknown is whether this strategy is any worse than any other strategy. It's so simple that it seems like it almost has to be worse - but I can't really put my finger on exactly how it would worse. I'm wondering if there is any research in the literature that critiques this approach.... ?</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>One improvement to this strategy would be to have a bit of interaction with the page cache. We evict pages in chunks, and perform a GC run across all the pages in a given chunk. If GC actually results in reclaimed space, the changed pages get pushed back into the safe at the top of the cache.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Tuning of the cache then becomes (in part) choosing a size to where page evictions result in entire pages being removed from the write set.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Like I said, this may be a complete loser approach - but all of that extra GC, etc... is ocurring in a background process, so maybe not... My biggest issue is that it requires multiple writes of the pages to the safe... And when we add translation pages to the mix, a single optimization of a given data page could result in a large number of pages having to be re-saved to cache.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>An optimization would be to do what you are proposing and have an allocator strategy that only tracks open record ranges in cached pages (with slots on dirty pages being given higher priority). </FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Certainly, having the translation info and data on the same page would fix some of this. How about the following as a possible compromise that may give the best of both worlds:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Maintain a PAGE translation system at the beginning of the db file, but keep record location inside the page itself with the data. A lookup would still consist of translation lookup to determine the physical page, but the position of a given record on it's page could change dynamically. This would allow us to quickly move pages around without having to adjust potentially scattered OIDs, and allow us to quickly coallesce freed records to maximize available space on a page.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>I'm still noodling on this idea - I think there is something here - I'm just not sure about the details.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>- K</FONT></DIV> <DIV><FONT face=Arial size=2> </DIV></FONT> <DIV><FONT face=Arial size=2> </FONT> <TABLE> <TBODY> <TR> <TD width=1 bgColor=blue><FONT face=Arial size=2></FONT></TD> <TD><FONT face=Arial size=2><FONT color=green>> <BR>Kevin, <BR> <BR>I was thinking that the pages in the cache would be those available for the record allocator. You simply notice how much free space is on each page in the page cache as it comes in and build up an access structure to find a page with sufficient free space for an allocation quickly. <BR> <BR>I certainly agree that writing to a clean page is a dead simple strategy when you are doing a bunch of inserts and makes it especially easy to cluster objects since there is only one tx doing the allocation and with data on the page. <BR> <BR>-bryan <BR> <BR><BR><BR>From: <A href="mailto:jdb...@li..."><FONT color=#0000ff>jdb...@li...</FONT></A> <A href="mailto:jdb...@li..."><FONT color=#0000ff>[mailto:jdb...@li...]</FONT></A> On Behalf Of Kevin Day<BR>Sent: Tuesday, April 11, 2006 3:46 PM<BR>To: JDBM Developer listserv<BR>Subject: re[4]: [Jdbm-developer] record managment <BR> <BR><BR>Bryan- <BR><BR> <BR><BR>Ahh - I see the concern now. <BR><BR> <BR><BR>I don't think this really resolves the allocator issue - you will still wind up having to keep track of which pages have how much space left on them. Imagine how the system will look after running for a long period of time - basically, you will have a bunch of pages, all with different sized dead zones in the middle. You don't want to have to scan each page looking for sufficient available space, so we are back into the allocator discussion. <BR><BR> <BR><BR> <BR><BR>Some other options to consider: <BR><BR> <BR><BR>1. Using a fibonnoci series index for starting positions of linked lists containing free records <BR><BR>or <BR><BR>2. Always write to clean pages, and use low priority optimization operation to move records from partially used pages to the wilderness zone - this has the advantage of completely removing the allocator from the discussion. Disadvantages would be that the db file could grow quite a bit during heavy load, then shrink back during the optimization. I think that this would be the best in terms of performance - a given transaction's update set would be basically written to disk in an entirely linear fashion, and it maximizes cache utilization. <BR><BR> <BR><BR> <BR><BR> <BR><BR>Tracking free slots is easy - we just add a pointer to each page with the first available slot on the page, and another to the next page that contains available slots. Slots would be assigned until the page was full, then it would move on to the next page. <BR><BR> <BR><BR> <BR><BR>I certainly see the need for a mechanism for allowing the application to force the db to group a set of records - I just think that this can be done in the context of arbitrary re-locations of records, using the translation system.... <BR><BR> <BR><BR> <BR><BR>- K <BR><BR> <BR><<BR></FONT></FONT></TD></TR></TBODY></TABLE></DIV></FONT></BODY></HTML> |
From: Kevin D. <ke...@tr...> - 2006-04-11 18:02:38
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma> <DIV><FONT face=Arial size=2>Bryan-</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The numbers you provide are definitely instructive... If you attach a profiler to your application, what percentage of the processing time is being spent in lookup? I suspect that inserts are where the majority of the performance hit is coming from, so I don't want to get too far into talking about ways of tweaking the last bit of performance out of reads, etc...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>My general opinion here is that if we want OIDs to have locality, that we do it by providing batch insert operations on the record manager (i.e. the application could call insert(Object[]) and get back a list of contiguous OIDs - the translation page allocator would then do whatever it had to to get contiguous blocks. I'm not sure that a complete re-think of the dedicated translation page thread is necessary to achive this goal. Once you have OIDs all on a page, then you can have the record allocator move records around as the application sees fit to obtain locality. Reading of all of the objects with 'locality' to each other would involve two page reads.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>I think that doing a lot of work to reduce 2 page reads to 1 page read may not really gain us much in terms of overall system improvement...</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>The good news is that the proposed MVCC architecture should be completely independent of the paging system, so we will be free to play with different choices if we see fit.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>I do want to explore some of the potential hotspots in the proof of concept and see if you have any thoughts on object to page cache interaction - I'll write that in a separate email.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>- K</FONT></DIV> <DIV><FONT face=Arial size=2></FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> </FONT> <TABLE> <TBODY> <TR> <TD width=1 bgColor=blue><FONT face=Arial size=2></FONT></TD> <TD><FONT face=Arial size=2><FONT color=red>> Kevin,<BR><BR>Comments below.<BR><BR>-bryan<BR><BR>________________________________________<BR>From: <A href="mailto:jdb...@li..."><FONT color=#0000ff>jdb...@li...</FONT></A><BR><A href="mailto:jdb...@li..."><FONT color=#0000ff>[mailto:jdb...@li...]</FONT></A> On Behalf Of Kevin Day<BR>Sent: Monday, April 10, 2006 9:36 PM<BR>To: JDBM Developer listserv<BR>Subject: re: [Jdbm-developer] record managment<BR><BR>Bryan-<BR> <BR>Interesting - this is similar to something that I saw in the (cricket?)<BR>architecture for a threaded log based db.<BR> <BR>A couple of questions/comments that may spark some discussion points or new<BR>directions: <BR> <BR>1. It seems that the proposed idea would work well until records are<BR>updated in a manner that increases their size. At that point, it seems<BR>unlikely to me that all of the records on a given page would remain on that<BR>page, so you'd have some entries that would overflow into other pages.<BR><BR>BBT: Yes, the notional design is that each page has a slot map and a data<BR>space. The slot map records the offset of the physical record on the page.<BR>If there is an overflow situation, then the slot map operates as a<BR>translation slot and records the _slot_ of the record on a _different_ page.<BR>At most one indirection could be performed. If all slots on a page are<BR>indirected, then this is the same as the current translation page solution.<BR> <BR>2. I'm pretty sure that this would preclude the ability of ever packing and<BR>shrinking the database... If I allocate tons of objects, then delete all<BR>but one on each page, I can't shrink the database or otherwise optimize it<BR>in any way. I have been envisioning a system that always places translation<BR>pages at the front of the db file, even if it means moving data pages out of<BR>the way during an allocation. In that way, the db can always be shrunk down<BR>to the size of the number of translation pages that contain active slots.<BR><BR>BBT: This design could degenerate into the translation page design (per<BR>above), but it is oriented towards reducing lookup for objects over<BR>compaction of the store.<BR> <BR>3. This design does allow for some form of space optimization without<BR>incuring the cost of updating the translation table itself - which<BR>definitely has bennefits.<BR> <BR>4. It seems that this would reduce the efficiency of the translation paging<BR>cache considerably - instead of having ~2000 slots on a page, we'd probably<BR>wind up with ~20-50 (obviously, it depends on the size of the data record). <BR>In the multi-version scenario, if a version table has to be written to disk<BR>at all (and in the vast majority of cases it will not), this *might* force a<BR>larger number of page writes to the safe than is strictly necessary. Given<BR>the non-locality of recids, this may or may not actually be a consideration,<BR>though.<BR><BR>BBT: I do not see how it could reduce the efficiency of the translation page<BR>caching. I think that the better analogy is that translation pages are an<BR>overhead that we can avoid in many scenarios, but that the design can<BR>degrade to the same performance as the translation page design. Also, I am<BR>very much thinking in terms of clustering support with this design.<BR> <BR>5. One way of addressing my concern with #2 (which is along the lines of<BR>the cricket architecture) is to use a b-tree to manage the offsets of the<BR>pages themselves. Each node in the BTree would represent the starting point<BR>of a contiguous block of pages - so if you have 50 pages on disk that are<BR>ordered by their logical ordering, it would require a single BTree node. <BR>This allows pages to be moved to arbitrary locations (and, in fact, sets you<BR>up for infinite stores - you just have occasional checkpoints that include<BR>an entire BTree).<BR> <BR>That, of course, re-introduces a lookup...<BR> <BR>6. Locality take a bit of a hit in terms of what is easy to implement vs<BR>not - if an application wants to request that the recman place a set of<BR>objects near each other, it can only do so (efficiently) in the context of<BR>the initial inserts of the objects (and only then if the allocator assigns<BR>recids that are near each other).<BR><BR>BBT: Yes, this is an interesting issue. It seems that the more information<BR>available to drive a clustering strategy the better. When an object is<BR>first created the most information that you might expect to have on hand is<BR>some notion of a "container". If you are able to defer until the object has<BR>been linked, then you can leverage the links and metadata about link classes<BR>to make clustering decisions. I see several ways of handling this: (1) use<BR>the translation page mode of the design and defer the placement of the<BR>physical records onto pages until more information is available. This would<BR>support clustering of physical records, but not of the OIDs. The existing<BR>jdbm defers serialization and writing onto the page image in an appropriate<BR>manner but does not use a clustering strategy to guide the choice of pages<BR>onto which to place the physical records. (2) use a policy in the<BR>application framework that supports persistence by reachability such that<BR>the OID assignment may be deferred. In this case, the OIDs could be<BR>clustered as well based. (3) use the notion of a "container" for clustering<BR>OIDs. Objects inserted into a container will be clustered together.<BR><BR>BBT: This highlights a possible problem with the use of translation pages.<BR>If the OIDs are not clustered, then a large store may demonstrate thrashing<BR>in the translation pages since the OIDs themselves do not demonstrate<BR>locality of reference. Clustering at persistent objects at insert appears<BR>to be required to minimize disk reads since otherwise reads of translation<BR>pages may be scattered. Here are some sample numbers.<BR><BR>#of objects in store: 10M<BR>average size of serialized object: 80 bytes.<BR>per object overhead: 8 bytes (record header) vs 11 bytes (slot size)<BR>adjusted average object size: 88 vs 91 bytes<BR>#of translation pages (jdbm 1.x): 12,237<BR>average #of objects per data page: 93 vs 90<BR>#of data pages: 107,422 vs 111,084<BR>Total #of pages: 119,659 vs 111,084<BR><BR>Now I am not convinced that the translation page design is bad. It does<BR>introduce an indirection, but if there is reasonable locality in the OIDs<BR>then the translation pages will be cached. If we do not have locality in<BR>the OIDs, then the new design will still have to do a lot of page reads for<BR>the physical records since they will not be effectively clustered.<BR><BR>The translation table design appears to be marginally more space efficient,<BR>but this is entirely a function of the difference in the per-object overhead<BR>(3 bytes), so little differences get magnified by the #of objects.<BR><BR>Kevin wrote:<BR> <BR>Of all of the above, the only one that I'm actually really concerned about<BR>is #2 - the rest can be addressed with fancy bookkeeping (a pain in the neck<BR>to develop, but it should be doable). I suspect that #5 is not going to be<BR>palatable, because it kind of defeats the purpose of doing this in the first<BR>place.<BR> <BR>BBT: I admit that I am less concerned with compaction of the store and more<BR>concerned with read performance and with the overhead associated with<BR>managing free space in the store. This is still the largest hotspot in jdbm<BR>1.x and takes 10% of the runtime for our application.<BR><BR>I think that the new design will provide faster management of free space on<BR>pages. The basic principle is to divide each page into a slot map<BR>(allocated like a stack - top down) and a data space (growing bottom up).<BR>The data space would be dense. The free space would exist between the end<BR>of the data space and the bottom of the slot map. Operations on the page<BR>would be atomic (thread safe). Write operations could reorder the data<BR>space and the slot map. Write operations could also force a physical record<BR>to become indirected (translated to another page) or to become direct (on<BR>page).<BR><BR>-bryan<BR> <BR>- K<BR> <BR> <BR> <BR><BR>> <BR>All, <BR> <BR>I have been re-thinking record management some. I want to share some of my<BR>thoughts here. I also want to see whether the version management<BR>requirements for kevins MVCC module could be satisfied by such a design. <BR> <BR>The main insight is that the existing translation page design could be<BR>rethought as one possible extreme in a more flexible, and hopefully more<BR>efficient, design. The basic change is that each page would incorporate the<BR>capability to store either and/or logical to physical translation slots or<BR>directly encode the physical record. The same slot map would be reused for<BR>both purposes with appropriate coding to indicate an on-page record vs. a<BR>record that was indirected to another page. <BR> <BR>Setting aside the page header for the minute, the design would have records<BR>growing up from the bottom of the page (lower address) and the slot map<BR>growing down from the top of the page. The lower bits of the OID would<BR>address slots in the page. Those slots would either give an on-page offset<BR>or the pageId and slot off the record on another page. The data space and<BR>the slot map space on the page would be kept dense, so there would never be<BR>free space between records. This should be aconsistent data structure and<BR>reconciliation of page images should be possible. <BR> <BR>This design allows us to directly lookup records by the OID without<BR>indirection. In the best case this means that we do one fetch for an object<BR>vs two. If the record is one page, then we are done. Otherwise we indirect<BR>and the effect is exactly as with the existing translation pages. In the<BR>extreme, if all records are indirected then the design is essentially the<BR>same as the existing translation page design. <BR> <BR>I am still playing around with record management rules for the page and<BR>there are clearly many designs that are possible and clustering remains an<BR>important issue. <BR> <BR>Got to run. <BR> <BR>-bryan <<BR><BR><BR><BR>-------------------------------------------------------<BR>This SF.Net email is sponsored by xPML, a groundbreaking scripting language<BR>that extends applications into web and mobile media. Attend the live webcast<BR>and join the prime developer group breaking into this new coding territory!<BR><A href="http://sel.as-us.falkag.net/sel?cmd"><FONT color=#0000ff>http://sel.as-us.falkag.net/sel?cmd</FONT></A><BR><BR><<BR></FONT></FONT></TD></TR></TBODY></TABLE></DIV></FONT></BODY></HTML> |