From: Thompson, B. B. <BRY...@sa...> - 2005-09-28 12:27:43
|
Kevin, A while back I posted on the notion of "logical containers." I don't believe that I ever cleared up the terminology to your satisfaction. However, the recent thread on the translation pages has given me some insight as to how to use such containers so as to consume those 10 remaining bits in the recid in the current translation page scheme. I'll sketch out the approach that I am suggesting. It has the advantage that it packs many logical records (from the application perspective) inside of a single physical record (from the jdbm perspective). The basic idea is that any given physical record serves as a logical record containing some application data, but can also serve as a container for other logical records. The container behavior provides for up to 2^N objects in the physical record, where N is the #of bits available for addressing -- 10 in the case of the current translation page scheme as you have outlined the current jdbm behavior. A flag marks whether the physical record is currently a container or not. When dynamically promoted to a container, the following data are added to the record: - #of logical records in the logical container (may span multiple physical containers) - #of logical records in this physical record (up to 2^N). - recid of the first secondary container (head of a linked list of zero or more physical records). - recid of the last secondary container (tail of a linked list of zero or more physical records). - long[] of keys into the container (maintained in sorted order) - Object[] of values in the container (indexed into by finding the matching recid in the key[]). - keySerializer - valueSerializer As you can see the physical record is dynamically transformed into something that is capable of looking up the recid in an internal index. The way this works is that the least significant N bits of the logical recid provided by the application are dropped in order to find the block and offset in the translation page that points to the physical record. Those least significant bits are then used to index into the container to find the logical record (so the key[] could be packed since the high order bits are always the same) The application inserts into a logical container using: insert( long containerHintId, Object obj ) The container hint is the recid of any object in that container. The serializer is the value serializer for the container (see above). When the address space for the physical record (those N bits) becomes full, a secondary container is created and objects are inserted into that container instead. Note that the CRUD operators have basically the same cost as the original jdbm CRUD operators since the only indirection is the translation page, exactly like current jdbm operations. The advantage is that many logical records may be packed into the same physical record, which has the same benefits that you noted with using a btree to pack records together: Multiple logical records in a single physical record means: - fewer, but larger, I/O transfers - fewer distinct physical records, so fewer objects in the cache. I have implemented a varient on this as a RecordManager interface that delegates to the cache and then to the BaseRecordManager. This works very nicely. However, given the discussion of the translation pages it is clear that this technique could be migrated directly into jdbm which would give us full utilization of those unused address bits. So, when considering design alternatives for the translation pages I think that this approach can be used to regain the full logical address space. The other consideration, of couse, is changing the translation pages in order to support reordering of the pages within the store. -bryan -----Original Message----- From: jdb...@li... [mailto:jdb...@li...] On Behalf Of Kevin Day Sent: Wednesday, September 28, 2005 2:12 AM To: JDBM Developer listserv Subject: [Jdbm-developer] More reverse lookup info All- Here are my thoughts for the evening - I'm looking forward to your responses and ideas... Logical row id reverse lookup redux -------------------------------------------------- Based on feedback from Markus, I went back and looked at the logical row ids. It turns out that the limit on the maximum logical row id has more to do with the 8 byte representation and how it maps to block/offset than it does to the amount of space actually available in the file. By taking advantage of this, and the fact that all logical rowids are divisible by 10 (because of the slot size in the translation page) I can get the minimum bytes to capture a logical row id down to 6 (Actually, 44 bits - which leaves a few bits for marking free and previous-free state to support coalescing). I'm pretty sure I've maxed it out at this point, though. Even with taking advantage of splitting records (to reduce the 4 byte size field to a 1 byte 'unused' field), I still can't get a record header smaller than 11 bytes. This is going to force us to either: 1. Maintain a separate table with reverse lookup or 2. Change the binary format of the file (which will require some sort of conversion procedure) I'm not keen on either, but I really don't see any alternative - hopefully someone else will? Conversion / Off-line pack and rebuild ----------------------------------------------------- I have some thoughts on a conversion procedure - I'm going to put them down here to see if anyone has any comments on this (note that this may make for an off-line pack and rebuild strategy as well): 1. Pull the length of the first logical row id's record 2. Allocate space from the available page list to store this record (probably use the used space of the record, not avaliable space), including the new header size (allocate new pages as necessary) 3. Insert that record at the new location 4. Update the translation page for the new record location, and insert the logical row id into the new record header 5. Mark the old record as free 6. Coallesce adjacent free records on either side of the old record, if possible 7. If the page containing the old record (or the result of coallescing the old record) now contains only free record data, move it to the free list 8. Update the value of the last logical row id to be processed (important for restarting without data corruption) 9. Every XX loops, call commit() 10. Continue with the next logical row id and repeat 11. After all logical row ids have been processed, reclaim any free pages that are now located at the end of the file A couple of comments on this strategy: First, we need to have some mechanism for pulling free pages that are as close to the front of the file as possible. We could order the list, or add parent, left and right pointers to the free page and implement a binary heap, but maybe there is some other mechanism that would be more efficient? Second, we would probably want to also move other page types (free logical and physical row id pages come to mind - but I think we can get rid of those entirely, so maybe data pages are the only ones we need to worry about after all. Third, the coallescing step is critical to this strategy working as it will free up pages. Otherwise, we could wind up with a bunch of unused free pages at the end of the process. Fourth, this strategy may work nicely as a general pack & rebuild operation (even if we aren't changing record header sizes). Fifth, this may even work as an on-line pack & rebuild operation (we still need the reverse lookup to allow us to pack translation pages to the front of the file, though, so this doesn't remove that need - see my comments in the next section "Translation Page Strategy" for what I'm thinking here). Basically, we could have a low priority thread that does one of the above loops, then goes to sleep. As long as the operation is synchronized properly, it should be possible to do this while the database is actually being used. The other approach would be to move through the loop with some duty cycle (i.e. every X inserts, run the loop), but doing it in a separate thread allows the pack to occur when the database isn't actively being used, so it won't effect performance. Sixth, we would need to make sure that we don't move records around unnecesarily. I think this is just a matter of only moving records if they belong to blocks that have unused space in them (allowing for the special condition where there can be 16 bytes + 1 record header at the end of the page). Is there anything else I might be missing? Translation Page Strategy ------------------------------------ By the way (also based on Markus' comments), I went back and spent some time thinking about the new translation page strategy. I think that using a tree structure like I was proposing is *not* the way to go. Instead, I think we should adopt a policy of always placing new translation pages near the front of the file, and that we never move a tranlation page once it is allocated. Maintaining a single page read for translation is going to be important, and I think we can gain the same desired benefit (i.e. creating a system where the translation pages don't prevent us from packing free space from the file) and still maintain O(1) lookups (and less disk space overhead as well). Any older files would continue to work as normal, but can never be smaller than the largest sized translation page that was allocated before the new code was put in place. In order for this to work, we will need to know the physical location of the largest translation page (excluding pages added before this strategy change). When a new translation page is requested, we go to the physical page immediately following, and move it's contents into a free page somewhere later in the file. We then mark the recently freed page as a translation page, link it to the correct page list, and start using it. There are some details about how to properly handle freed translation pages, but this strategy seems like it would work. Basically, we recognize that we can't practically move translation pages around, so we make sure we place them initially where we want them to be. For this to work, though, we have to be able to move data pages around on-demand - and the only way that's going to happen is if we have a reverse lookup to fnd the logical row id for a given record. Stumped on how best to do this... - K ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Jdbm-developer mailing list Jdb...@li... https://lists.sourceforge.net/lists/listinfo/jdbm-developer |
From: Kevin D. <ke...@tr...> - 2005-09-29 05:20:55
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2668" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma size=2> <DIV>Bryan-<BR></DIV> <DIV>Thanks for the description.</DIV> <DIV> </DIV> <DIV>I'm still coming off a wicked cold, so my brain is a bit fuzzy, so bear with me. As with most things, concrete examples help, so I'm going to try to put together a quick trace to help me understand:</DIV> <DIV> </DIV> <DIV>logical row id = [12345][67] <-- I'm using the [][] notation to reflect values that are recovered by bit mask and shift operations - I don't want to have to think too hard here, so let's just say we have an arbitrary logical row id that allows us to yank 2 values from it.</DIV> <DIV> </DIV> <DIV>You use [12345] to locate the first page of the container. This page basically contains an array of longs that act as page/offset values into the logical container. Are you then looking for the value 67 in the keys using a binary search, then taking that index and mapping to the object array?</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>OK - given that my understanding of the situation is as described above, here are my comments:</DIV> <DIV> </DIV> <DIV>1. This approach requires that lots of objects be deserialized at the same time. This may or may not be a good thing, depending on the application's needs (more on that in a second - I think there may be some things about how the jdbm layers interact that hasn't been made as clear as it should).</DIV> <DIV> </DIV> <DIV>2. If you happen to have an application that needs a data structure that meets the following conditions, then this is a good data structure:</DIV> <DIV> </DIV> <DIV>a) Keys are longs (or some other primitive number - strings or more complicated objects that require custom comparators are out of the question), and you don't need to be able to specify the values of the keys explicitly at the application level (i.e. you couldn't key off of some value that was actually a long - such as a datestamp or anything like that).</DIV> <DIV> </DIV> <DIV>b) There is a strong implicit, but not necessarily well defined, relationship between groups of stored objects.</DIV> <DIV> </DIV> <DIV>c) You don't have to store very many "related" objects, or the objects are really, really small. In the current implementation, if a large number of objects are being stored, then looking up a single value will be very expensive in terms of disk reads. In the example you give (with object array sized at 2^10), and assuming a (very) modest serialized data size per object of 32 bytes, you would wind up spanning 5 or 6 pages. That's a lot of disk reading, and a lot of deserialization, just to retrieve one object. </DIV> <DIV> </DIV> <DIV>3. Now that I'm thinking about it, I'm pretty sure that your solution here is a degenerate BTree - basically, this logical container is a set of BPage leafs, and your addressing scheme is used to allow direct access to the page of interest without walking the tree. If you look at the BPage code, and remove all data members that are non-leaf related, I think you will find that the data structures are identical. Is there anything else here that I might be missing?</DIV> <DIV> </DIV> <DIV>So, to summarize, the real advantage of this data structure is that it allows you to say "Store Object Y, and by the way, most of the time I'm going to use it, I'm going to also be using Object X that I've already stored". This is certainly very, very useful in some situations (although none come to mind at the moment - but that's more a reflection of the type of work I do...), and removes the requirement of figuring out how to allocate keys so they clump related objects together.</DIV> <DIV> </DIV> <DIV>One thing I am unclear on, though, is how you handle the situation where I insert an object and say it is related to another object that is on a full page. Do you just not worry about it at that point? In a BTree, the tree would actually be adjusted to provide proximity... There's certainly no reason that you couldn't have a similar mechanism, but I think it would be very inneficient compared to the BTree approach of splitting pages and balancing the tree.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV> </DIV> <DIV>OK, enough with the comments, now for a quick discussion about jdbm IO...</DIV> <DIV> </DIV> <DIV>In your description of the data structure, you said that one advantage of this is fewer (but larger) IO transfers. This is actually not the case.</DIV> <DIV> </DIV> <DIV>One thing that hasn't been talked about a lot in all of our discussions about how jdbm works is the way IO occurs. I'm extrapolating a bit from your comment, so if I get it wrong, please gently correct me... When you read a record from the record manager, jdbm does not go to that offset on the disk and read the bytes for that record. jdbm *always* performs it's disk IO one page at a time (2^13 bytes). If you are reading a record that starts on byte 2000, then it reads the entire page, then does the offset into the page and yanks the bytes out for deserialization.</DIV> <DIV> </DIV> <DIV>This is the reason that I've been puzzled about your request for adding pre-fetch to the system. For all intents and purposes, there *is* a pre-fetch mechanism, or as close as you can come to one in a general purpose, random access IO system. It would be possible to do a little bit better here in terms of allowing the developer to specify a set of logical row ids to fetch, then do a gathering read (there's that NIO again!) to pull any pages that aren't in cache already.</DIV> <DIV> </DIV> <DIV>This has a couple of implications that may have a big impact on the perceived benefits of the logical container data structure over storing individual records (this is not to say that there aren't advantages for certain data sets, but I think that the following will remove any concern you have about efficient disk access):</DIV> <DIV> </DIV> <DIV>If you have a lot of small objects stored on a page (just to be consistent with my meanderings above, let's say that we have 32 byte objects, which will give you about 256 objects per page. If you request the 90th object on the page, the entire page is read into memory, the bytes for just the 90th object are deserialized, and the object is returned. This is where the object cache comes in to play. Any future request for that object will return it from the cache, so you never have to deserialize that object unless it is actually evicted from the cache (both L1 and L2).</DIV> <DIV> </DIV> <DIV>Another point of interest is that there is a page cache. This means that if you then turn around and read the 95th object, all that is involved is a deserialization operation - no disk IO is required, because the page is already in the page cache. This, of course, makes the assumption that you actually *want* to read object 90 and 95 in close proximity to each other.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>(as a side note - the size of the page cache is actually a high-watermark, which is one of the reasons that it was important to commit() even if you had transactions turned off - even after you commit() the transaction, the total number of pages used in that transaction will continue to be in memory ).</DIV> <DIV> </DIV> <DIV> </DIV> <DIV> </DIV> <DIV>Some other comments: I'm actually not at all concerned about the "wasted" space in the current logical record id. The amount of addressable space in the file shrinks as you add more logical row ids (i.e. the translation table grows). This means, that you'd have to have a file filled with 8 byte objects in order to run out of logical row ids before you run out of addressable file space (to say nothing of the fact that you'd have run out of physical drive space eons before that - as a fun exercise, calculate how long it would take to write 2^63 bytes with modern disk bus speeds... it's 1 billion years... the sun would not quite have actually consumed the earth, but I think you get the picture :-) )</DIV> <DIV> </DIV> <DIV>I was actually quite happy that the wasted space was there, because it meant I could be tricky and shrink the reverse lookup value in the record header so we don't clobber disk space in order to have reverse lookup.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV> </DIV> <DIV>Enough for now - lot's of stuff to think about.</DIV> <DIV> </DIV> <DIV>- K</DIV> <DIV> </DIV> <DIV> </DIV> <DIV> </DIV> <DIV><BR> > Kevin, <BR><BR>A while back I posted on the notion of "logical containers." I don't believe that I ever cleared up the <BR>terminology to your satisfaction. However, the recent thread on the translation pages has given me <BR>some insight as to how to use such containers so as to consume those 10 remaining bits in the <BR>recid in the current translation page scheme. I'll sketch out the approach that I am suggesting. It <BR>has the advantage that it packs many logical records (from the application perspective) inside of a <BR>single physical record (from the jdbm perspective). <BR><snip><BR></DIV></FONT></BODY></HTML> |