From: 'Kevin D. ' <ke...@tr...> - 2005-10-03 17:02:57
|
Resend: The SF listserv did something goofy to my post... - K Bryan- >> for example I have objects that have a one-to-one relationship with a btree. I think this might be a micro-optimization - the BTree object is a trivial read - all the data is actually stored in the BPages, and you definitely don't want to get into trying to put BPages into your containers (that would defeat the whole point of a = BTree). I think that the practical uses of this type of container are not orthogonal - for example, I could make the reverse argument that I might want to store your containers inside a BTree (via a string key, for example). I really think that it is appropriate to implement these containers at a level where they are independent of the record manager - this will give you considerably more freedom in your use of the containers (for example, you can optimize the slot count of = each container page based on the amount of data it will contain), and will make sure that if we do need to make any changes to the logical row id structure that we haven't painted ourselves into a corner. I also think that this will make the container implementation better encapsulated, because you don't have to deal with promoting objects in the container itself. You could still make promoting of objects transparent to your application with a thin = interface layer (what you are doing with the record manager), but by keeping the container itself as a higher level object, it becomes much more accessible to the general jdbm community. For example, in my current application, I have objects that serialize into large byte arrays. If I were going to decide that I wanted to do the extra work to provide hints that certain of these objects should be attempted to be loaded together, I would = absolutely not want to use the full 10 bits. I might want to create a container that only holds 5 objects per page (so as not to wind up with a container that spans 30 or 40 disk pages). I might want to create another container that could hold 500 = objects per page for another type of object in my data store that serializes to a small byte array. I'm sure it would be possible to add more settings, etc... to an implementation that was closely tied to the record manager, but I think that mixing the layers like this results in a less than optimal design for a *general* container (I'm certain that = doing it in the record manager is optimal for your particular application - but I think the cost of doing it via the record manager is quite low, and that it would be better to have a general purpose container available to all users). What do you think about calling this structure an "Associative Container" or a "Clumping Container" or something that is a bit more clear as to the purpose of the container? I'm looking forward to hearing your comments! - K > Kevin, I've been thinking about this also. The role of the "container hint" - place X, Y, Z into the same container as A - is to suggest to the persistence layer that {A,X}, {A,Y}, {A,Z} are often co-materialized and to take whatever steps it can to facilitate that. One way to=20 facilitate that is using a compound record - much like a node of a btree. Doing this efficiently requires encoding the offset within a container as part of the recid used by the application. Right now I am wrapping the cache/base record manager and shifting all recids N bits left to make room for a N bit address space within a container. Given what I know about the translation pages, it is clear that I have 10 bits for free, which is plenty. However if the logical recids became dense it would introduce a clear tradeoff between the address space within a container and the #of addressable records. I don't see any way around that -- at least using this technique. Unlike the btree or hash support, container hints can occur with any record, even a record that is a btree, bpage, htree, etc. In this sense they are orthogonal to the traditional container techniques. In fact some of these cases make good sense, for example I have objects that have a one-to-one relationship with a btree. The use of containers means that I could have both my application object and the BTree in the same physical record. Another approach to containers would be to manage which physical records fit on the same page (8K block). If we had a reverse lookup from physical records to translation pages, then we could rearrange physical records, even ones that pre-exist, so that co- materialization hints were respected. That is not a good match=20 with the current design since, to the best of my knowledge, pages are not "cached" in any sense and only the fetched record on a page is actually deserialized. -bryan -----Original Message----- From: jdb...@li... To: ''JDBM Developer listserv ' ' Sent: 9/29/2005 6:35 PM Subject: re[6]: [Jdbm-developer] logical containers and translation pa ges (wa s More reverse lookup info) Bryan- I think that your container belongs at the same level as BTree and HTree - coupling it directly to the page manager would make our lives more difficult down the road when we start implementing pack & rebuild and page level optimizations (NIO, etc...). Is that what you you mean by "native jdbm feature"? Just wanted to confirm. At some point in the future if you start looking at alternative designs for keeping the pages balanced, you may want to take a look at how BTree handles balancing - the data structures are so similar that there may be some hints in there that could be useful. Cheers, and thanks for helping me understand how these things work. - Kevin > Kevin, I've been working my way towards that. With the current translation page scheme I think that I know how to take my existing code (which implements the RecordManager interface and delegates to the Cache/BaseRecordManager) and migrate it to be a native jdbm feature. This is definately the way to go. However, I am not sure how to do this if we wind up changing the jdbm translation pages. With respect to secondary containers, the rule that I use is that objects are inserted into a secondary container once the primary container is full (there are no more address bits available). This is not at the "page" level, but at the "record" level. An insert operation inserts into the primary container if there are free addressable slots and otherwise into the tail of the linked list of secondary containers. I then rotate any seconday container which is no longer full (through a delete of some record in that container) to the tail. This management strategy appears to work well. Your point concerning correlation is correct. The "container" is a hint to the persistence layer, not a gaurentee that one object winds up in the same record as another. I have not explored this, but you could clearly move logical records around within containers based on historical evidence of correlated access paths. Such an approach could doubtless be developed to place objects within containers dynamically based on historical paths. -bryan -----Original Message----- From: <mailto:jdb...@li...> jdb...@li... To: 'JDBM Developer listserv ' Sent: 9/29/2005 12:46 PM Subject: re[4]: [Jdbm-developer] logical containers and translation pa ges (wa s More reverse lookup info) Bryan- Fantastic. I would love to see this added as another container type to jdbm (BTree, HTree and logical container). I have to admit that the term "logical container" is not particularly descriptive... Is there a more formal name for this data structure? Also, as an implementation detail, do you wind up splitting pages when requests are made to store too many "related" objects? Or do you just add another page? I'm asking because it seems that if I have object A in the store, then I add object B and say that it's usage is correlated to A but there isn't enough space, and that object gets placed on a new page, then I add object C and say that it's usage is correlated to A, C will be placed on the same page as B, but neither will be in proximity to A. I guess the point is that B and C may not be closely correlated at all, even though they are both correlated to A. Just a thought... - K > Kevin, I believe that your analysis is essentially correct. I will note that using the DumpUtility I have found that the "container" objects tend to find on one page, e.g., a useful container is still smaller than 8k in practice (for our application). This is basically a scheme for object locality. =20 All objects in the container are deserialized at once (like BPage). The physical records are cached. Once the container has been fetched a subsequent request for any object in that container results in a cache hit on the container and then the binary search on the keys to find the already deserialized object. In all, this has proven to be extremely effective for our application. We see a 100% performance boost. -bryan -----Original Message----- From: <mailto:jdb...@li...> <mailto:jdb...@li...> <mailto:jdb...@li...> jdb...@li... To: JDBM Developer listserv Sent: 9/29/2005 1:23 AM Subject: re[2]: [Jdbm-developer] logical containers and translation pages (wa s More reverse lookup info) Bryan- Thanks for the description. I'm still coming off a wicked cold, so my brain is a bit fuzzy, so bear with me. As with most things, concrete examples help, so I'm going to try to put together a quick trace to help me understand: logical row id =3D [12345][67] <-- I'm using the [][] notation to reflect values that are recovered by bit mask and shift operations - I don't want to have to think too hard here, so let's just say we have an arbitrary logical row id that allows us to yank 2 values from it. You use [12345] to locate the first page of the container. This page basically contains an array of longs that act as page/offset values into the logical container. Are you then looking for the value 67 in the keys using a binary search, then taking that index and mapping to the object array? OK - given that my understanding of the situation is as described above, here are my comments: 1. This approach requires that lots of objects be deserialized at the same time. This may or may not be a good thing, depending on the application's needs (more on that in a second - I think there may be some things about how the jdbm layers interact that hasn't been made as clear as it should). 2. If you happen to have an application that needs a data structure that meets the following conditions, then this is a good data structure: a) Keys are longs (or some other primitive number - strings or more complicated objects that require custom comparators are out of the question), and you don't need to be able to specify the values of the keys explicitly at the application level (i.e. you couldn't key off of some value that was actually a long - such as a datestamp or anything like that). b) There is a strong implicit, but not necessarily well defined, relationship between groups of stored objects. c) You don't have to store very many "related" objects, or the objects are really, really small. In the current implementation, if a large number of objects are being stored, then looking up a single value will be very expensive in terms of disk reads. In the example you give (with object array sized at 2^10), and assuming a (very) modest serialized data size per object of 32 bytes, you would wind up spanning 5 or 6 pages. That's a lot of disk reading, and a lot of deserialization, just to retrieve one object. =20 3. Now that I'm thinking about it, I'm pretty sure that your solution here is a degenerate BTree - basically, this logical container is a set of BPage leafs, and your addressing scheme is used to allow direct access to the page of interest without walking the tree. If you look at the BPage code, and remove all data members that are non-leaf related, I think you will find that the data structures are identical. Is there anything else here that I might be missing? So, to summarize, the real advantage of this data structure is that it allows you to say "Store Object Y, and by the way, most of the time I'm going to use it, I'm going to also be using Object X that I've already stored". This is certainly very, very useful in some situations (although none come to mind at the moment - but that's more a reflection of the type of work I do...), and removes the requirement of figuring out how to allocate keys so they clump related objects together. One thing I am unclear on, though, is how you handle the situation where I insert an object and say it is related to another object that is on a full page. Do you just not worry about it at that point? In a BTree, the tree would actually be adjusted to provide proximity... There's certainly no reason that you couldn't have a similar mechanism, but I think it would be very inneficient compared to the BTree approach of splitting pages and balancing the tree. OK, enough with the comments, now for a quick discussion about jdbm IO... In your description of the data structure, you said that one advantage of this is fewer (but larger) IO transfers. This is actually not the case. One thing that hasn't been talked about a lot in all of our discussions about how jdbm works is the way IO occurs. I'm extrapolating a bit from your comment, so if I get it wrong, please gently correct me... When you read a record from the record manager, jdbm does not go to that offset on the disk and read the bytes for that record. jdbm *always* performs it's disk IO one page at a time (2^13 bytes). If you are reading a record that starts on byte 2000, then it reads the entire page, then does the offset into the page and yanks the bytes out for deserialization. This is the reason that I've been puzzled about your request for adding pre-fetch to the system. For all intents and purposes, there *is* a pre-fetch mechanism, or as close as you can come to one in a general purpose, random access IO system. It would be possible to do a little bit better here in terms of allowing the developer to specify a set of logical row ids to fetch, then do a gathering read (there's that NIO again!) to pull any pages that aren't in cache already. This has a couple of implications that may have a big impact on the perceived benefits of the logical container data structure over storing individual records (this is not to say that there aren't advantages for certain data sets, but I think that the following will remove any concern you have about efficient disk access): If you have a lot of small objects stored on a page (just to be consistent with my meanderings above, let's say that we have 32 byte objects, which will give you about 256 objects per page. If you request the 90th object on the page, the entire page is read into memory, the bytes for just the 90th object are deserialized, and the object is returned. This is where the object cache comes in to play. Any future request for that object will return it from the cache, so you never have to deserialize that object unless it is actually evicted from the cache (both L1 and L2). Another point of interest is that there is a page cache. This means that if you then turn around and read the 95th object, all that is involved is a deserialization operation - no disk IO is required, because the page is already in the page cache. This, of course, makes the assumption that you actually *want* to read object 90 and 95 in close proximity to each other. (as a side note - the size of the page cache is actually a high-watermark, which is one of the reasons that it was important to commit() even if you had transactions turned off - even after you commit() the transaction, the total number of pages used in that transaction will continue to be in memory ). Some other comments: I'm actually not at all concerned about the "wasted" space in the current logical record id. The amount of addressable space in the file shrinks as you add more logical row ids (i.e. the translation table grows). This means, that you'd have to have a file filled with 8 byte objects in order to run out of logical row ids before you run out of addressable file space (to say nothing of the fact that you'd have run out of physical drive space eons before that - as a fun exercise, calculate how long it would take to write 2^63 bytes with modern disk bus speeds... it's 1 billion years... the sun would not quite have actually consumed the earth, but I think you get the picture :-) ) I was actually quite happy that the wasted space was there, because it meant I could be tricky and shrink the reverse lookup value in the record header so we don't clobber disk space in order to have reverse lookup. Enough for now - lot's of stuff to think about. - K > Kevin,=20 A while back I posted on the notion of "logical containers." I don't believe that I ever cleared up the=20 terminology to your satisfaction. However, the recent thread on the translation pages has given me=20 some insight as to how to use such containers so as to consume those 10 remaining bits in the=20 recid in the current translation page scheme. I'll sketch out the approach that I am suggesting. It=20 has the advantage that it packs many logical records (from the application perspective) inside of a=20 single physical record (from the jdbm perspective).=20 <snip> ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. <http://solutions.newsforge.com/ibmarch.tmpl> <http://solutions.newsforge.com/ibmarch.tmpl> <http://solutions.newsforge.com/ibmarch.tmpl> http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Jdbm-developer mailing list <mailto:Jdb...@li...> <mailto:Jdb...@li...> <mailto:Jdb...@li...> Jdb...@li... <https://lists.sourceforge.net/lists/listinfo/jdbm-developer> <https://lists.sourceforge.net/lists/listinfo/jdbm-developer> <https://lists.sourceforge.net/lists/listinfo/jdbm-developer> https://lists.sourceforge.net/lists/listinfo/jdbm-developer < ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. <http://solutions.newsforge.com/ibmarch.tmpl> http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Jdbm-developer mailing list <mailto:Jdb...@li...> Jdb...@li... <https://lists.sourceforge.net/lists/listinfo/jdbm-developer> https://lists.sourceforge.net/lists/listinfo/jdbm-developer < ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Jdbm-developer mailing list Jdb...@li... https://lists.sourceforge.net/lists/listinfo/jdbm-developer < |