From: Thompson, B. B. <BRY...@sa...> - 2006-04-11 18:46:24
|
The hotspot that I described is physical row allocation. Insert and fetch both drive that since it forces objects to be evicted from the cache, but they are not otherwise a hotspot. I think that the design I am proposing could address this hotspot by creating a free space between the data and slot maps. I am sure that there are other ways to solve the problem as well. However, we have been primarily benchmarking an insert intensive scenario which computes the closure of a knowledge base as it loads statements into the KB. Read optimization is definitely going to be an issue for the application and the ability to read objects directly from their OID could be a huge win, as could clustering of OIDs and/or objects. Basically, we need set-oriented read operations for joins and the like - and that requires clustered data to be successful. If you are doing a broad scan of statements in the kb, e.g., all "Persons", then you might actually see the full 2:1 performance benefit for a large KB since, in that situation, you would be reading a large #of statements without revisiting any and you could burn right through the cache. I think that this is really where this becomes an issue - when a read operation is accessing large swaths of persistent data in access patterns where the cache gives you limited benefit. E.g., you need to read 10x or 100x more pages from disk than will fit into your cache. We could keep a separate cache for translation pages, but you see my point. Basically, I would like to explore various solutions here and see what works. As you indicated, I would like to see all of the components isolated enough so that we can plug and play some different strategies and measure them against some applications of interest. -bryan _____ From: jdb...@li... [mailto:jdb...@li...] On Behalf Of Kevin Day Sent: Tuesday, April 11, 2006 2:06 PM To: JDBM Developer listserv Subject: re[2]: [Jdbm-developer] record managment Bryan- The numbers you provide are definitely instructive... If you attach a profiler to your application, what percentage of the processing time is being spent in lookup? I suspect that inserts are where the majority of the performance hit is coming from, so I don't want to get too far into talking about ways of tweaking the last bit of performance out of reads, etc... My general opinion here is that if we want OIDs to have locality, that we do it by providing batch insert operations on the record manager (i.e. the application could call insert(Object[]) and get back a list of contiguous OIDs - the translation page allocator would then do whatever it had to to get contiguous blocks. I'm not sure that a complete re-think of the dedicated translation page thread is necessary to achive this goal. Once you have OIDs all on a page, then you can have the record allocator move records around as the application sees fit to obtain locality. Reading of all of the objects with 'locality' to each other would involve two page reads. I think that doing a lot of work to reduce 2 page reads to 1 page read may not really gain us much in terms of overall system improvement... The good news is that the proposed MVCC architecture should be completely independent of the paging system, so we will be free to play with different choices if we see fit. I do want to explore some of the potential hotspots in the proof of concept and see if you have any thoughts on object to page cache interaction - I'll write that in a separate email. - K > Kevin, Comments below. -bryan ________________________________________ From: jdb...@li... <mailto:jdb...@li...> [mailto:jdb...@li...] <mailto:jdb...@li...> On Behalf Of Kevin Day Sent: Monday, April 10, 2006 9:36 PM To: JDBM Developer listserv Subject: re: [Jdbm-developer] record managment Bryan- Interesting - this is similar to something that I saw in the (cricket?) architecture for a threaded log based db. A couple of questions/comments that may spark some discussion points or new directions: 1. It seems that the proposed idea would work well until records are updated in a manner that increases their size. At that point, it seems unlikely to me that all of the records on a given page would remain on that page, so you'd have some entries that would overflow into other pages. BBT: Yes, the notional design is that each page has a slot map and a data space. The slot map records the offset of the physical record on the page. If there is an overflow situation, then the slot map operates as a translation slot and records the _slot_ of the record on a _different_ page. At most one indirection could be performed. If all slots on a page are indirected, then this is the same as the current translation page solution. 2. I'm pretty sure that this would preclude the ability of ever packing and shrinking the database... If I allocate tons of objects, then delete all but one on each page, I can't shrink the database or otherwise optimize it in any way. I have been envisioning a system that always places translation pages at the front of the db file, even if it means moving data pages out of the way during an allocation. In that way, the db can always be shrunk down to the size of the number of translation pages that contain active slots. BBT: This design could degenerate into the translation page design (per above), but it is oriented towards reducing lookup for objects over compaction of the store. 3. This design does allow for some form of space optimization without incuring the cost of updating the translation table itself - which definitely has bennefits. 4. It seems that this would reduce the efficiency of the translation paging cache considerably - instead of having ~2000 slots on a page, we'd probably wind up with ~20-50 (obviously, it depends on the size of the data record). In the multi-version scenario, if a version table has to be written to disk at all (and in the vast majority of cases it will not), this *might* force a larger number of page writes to the safe than is strictly necessary. Given the non-locality of recids, this may or may not actually be a consideration, though. BBT: I do not see how it could reduce the efficiency of the translation page caching. I think that the better analogy is that translation pages are an overhead that we can avoid in many scenarios, but that the design can degrade to the same performance as the translation page design. Also, I am very much thinking in terms of clustering support with this design. 5. One way of addressing my concern with #2 (which is along the lines of the cricket architecture) is to use a b-tree to manage the offsets of the pages themselves. Each node in the BTree would represent the starting point of a contiguous block of pages - so if you have 50 pages on disk that are ordered by their logical ordering, it would require a single BTree node. This allows pages to be moved to arbitrary locations (and, in fact, sets you up for infinite stores - you just have occasional checkpoints that include an entire BTree). That, of course, re-introduces a lookup... 6. Locality take a bit of a hit in terms of what is easy to implement vs not - if an application wants to request that the recman place a set of objects near each other, it can only do so (efficiently) in the context of the initial inserts of the objects (and only then if the allocator assigns recids that are near each other). BBT: Yes, this is an interesting issue. It seems that the more information available to drive a clustering strategy the better. When an object is first created the most information that you might expect to have on hand is some notion of a "container". If you are able to defer until the object has been linked, then you can leverage the links and metadata about link classes to make clustering decisions. I see several ways of handling this: (1) use the translation page mode of the design and defer the placement of the physical records onto pages until more information is available. This would support clustering of physical records, but not of the OIDs. The existing jdbm defers serialization and writing onto the page image in an appropriate manner but does not use a clustering strategy to guide the choice of pages onto which to place the physical records. (2) use a policy in the application framework that supports persistence by reachability such that the OID assignment may be deferred. In this case, the OIDs could be clustered as well based. (3) use the notion of a "container" for clustering OIDs. Objects inserted into a container will be clustered together. BBT: This highlights a possible problem with the use of translation pages. If the OIDs are not clustered, then a large store may demonstrate thrashing in the translation pages since the OIDs themselves do not demonstrate locality of reference. Clustering at persistent objects at insert appears to be required to minimize disk reads since otherwise reads of translation pages may be scattered. Here are some sample numbers. #of objects in store: 10M average size of serialized object: 80 bytes. per object overhead: 8 bytes (record header) vs 11 bytes (slot size) adjusted average object size: 88 vs 91 bytes #of translation pages (jdbm 1.x): 12,237 average #of objects per data page: 93 vs 90 #of data pages: 107,422 vs 111,084 Total #of pages: 119,659 vs 111,084 Now I am not convinced that the translation page design is bad. It does introduce an indirection, but if there is reasonable locality in the OIDs then the translation pages will be cached. If we do not have locality in the OIDs, then the new design will still have to do a lot of page reads for the physical records since they will not be effectively clustered. The translation table design appears to be marginally more space efficient, but this is entirely a function of the difference in the per-object overhead (3 bytes), so little differences get magnified by the #of objects. Kevin wrote: Of all of the above, the only one that I'm actually really concerned about is #2 - the rest can be addressed with fancy bookkeeping (a pain in the neck to develop, but it should be doable). I suspect that #5 is not going to be palatable, because it kind of defeats the purpose of doing this in the first place. BBT: I admit that I am less concerned with compaction of the store and more concerned with read performance and with the overhead associated with managing free space in the store. This is still the largest hotspot in jdbm 1.x and takes 10% of the runtime for our application. I think that the new design will provide faster management of free space on pages. The basic principle is to divide each page into a slot map (allocated like a stack - top down) and a data space (growing bottom up). The data space would be dense. The free space would exist between the end of the data space and the bottom of the slot map. Operations on the page would be atomic (thread safe). Write operations could reorder the data space and the slot map. Write operations could also force a physical record to become indirected (translated to another page) or to become direct (on page). -bryan - K > All, I have been re-thinking record management some. I want to share some of my thoughts here. I also want to see whether the version management requirements for kevins MVCC module could be satisfied by such a design. The main insight is that the existing translation page design could be rethought as one possible extreme in a more flexible, and hopefully more efficient, design. The basic change is that each page would incorporate the capability to store either and/or logical to physical translation slots or directly encode the physical record. The same slot map would be reused for both purposes with appropriate coding to indicate an on-page record vs. a record that was indirected to another page. Setting aside the page header for the minute, the design would have records growing up from the bottom of the page (lower address) and the slot map growing down from the top of the page. The lower bits of the OID would address slots in the page. Those slots would either give an on-page offset or the pageId and slot off the record on another page. The data space and the slot map space on the page would be kept dense, so there would never be free space between records. This should be aconsistent data structure and reconciliation of page images should be possible. This design allows us to directly lookup records by the OID without indirection. In the best case this means that we do one fetch for an object vs two. If the record is one page, then we are done. Otherwise we indirect and the effect is exactly as with the existing translation pages. In the extreme, if all records are indirected then the design is essentially the same as the existing translation page design. I am still playing around with record management rules for the page and there are clearly many designs that are possible and clustering remains an important issue. Got to run. -bryan < ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd <http://sel.as-us.falkag.net/sel?cmd> < ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Jdbm-developer mailing list Jdb...@li... https://lists.sourceforge.net/lists/listinfo/jdbm-developer |