From: Kevin D. <ke...@tr...> - 2005-09-27 17:50:09
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2668" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma size=2> <DIV>Markus-<BR></DIV> <DIV>Good to hear from you.</DIV> <DIV> </DIV> <DIV>I'll reply to each of your questions/comments in turn (please pardon my stream of thought on some of this - I'm desiging as I go here...):</DIV> <DIV> </DIV> <DIV>Binary format having to change:</DIV> <DIV>-------------------------------------</DIV> <DIV>Yup - completely agreed. I have spent a lot of time thinking about this, and I really don't see any decent alternative. We could store the reverse lookup table in a separate part of the database, and just populate it by walking the logical row id list - that makes the migration easy - but it is considerably less than optimal.</DIV> <DIV> </DIV> <DIV>For cases where upgrading the file format is necessary, we may have to use an approach of reading records from an old file and writing them to the new format. This is defininitely a tricky aspect of things, but I think it's something that we do have to address.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>Just curious, where does this limit (maximum block id) exactly come from?<BR>-----------------------------</DIV> <DIV>Technically, it is hard coded in the Magic class - but the theoretical foundation of it is in allowing storage of physical location in a singe long. 2 bytes (16 bits) are reserved for the offset in the block, and a primitive long has 8 bytes (64 bits). Of those bits, 1 bit is used for sign, so we really have 63 bits. Subtracting 16 bits from 63 gives 47, so the biggest block number is 2^47.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>And one other important thing to remember is that the logical ids are not assigned very "dense</DIV> <DIV>----------------------------</DIV> <DIV>Actually, in my new translation page strategy, they will be very dense. For a brand new file, it will be pretty much perfectly dense. For transitional files (where we have to hold on to the old logical row ids), it will be sparsely populated, but clumped. The translation page needs to be overhauled anyway to allow it to move around. I can provide details on the new design I'm thinking about if you'd be interested in collaborating on the theoretical design (obviously, I'll accept offers of help for the coding/testing as well, but at this point I think it's more important to work out the theoretical design and make sure there aren't any fatal flaws).</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>This proves that the current file structure is pretty scalable for future use, and if you ask me, there should not be changed anything about it.</DIV> <DIV>---------------------------</DIV> <DIV>I agree. However, given the limitations of the current addressing scheme, we can reduce the address space of the logical row ids without reducing the current addressable size of the design. I am inclined to leave the current addressable range alone, but reduce the potential address space of the logical rowid so it can be captured in 7 bytes instead of 8. It will still be managed as a long in memory, but on disk, in the record header, it will be 7 bytes.</DIV> <DIV> </DIV> <DIV>The new translation page design has no such restriction (it can actually be extended relatively easily to allow for an infininte number of logical row ids, so if we wind up with a longlong primitive in Java v 9 we can take advantage of it :-) ).</DIV> <DIV> </DIV> <DIV> </DIV> <DIV> </DIV> <DIV>One problem during defragging could be the assignment of logical row <BR>ids. If I am not totally mistaken, they're the position in a list of the <BR>physical positions of all records.</DIV> <DIV>------------------------------------------</DIV> <DIV>Yup - thus the new design of the translation page system</DIV> <DIV> </DIV> <DIV><BR><BR>We use logical ids to identify a record. Because records tend to be <BR>physically moved through the database, there's used a mapping <BR>id->physical position (that are the translation pages used for, right?). </DIV> <DIV>-----------------</DIV> <DIV>Interestingly, this is what translation pages *can* be used for. In the current implementation of jdbm, physical records are never moved. Having the translation page allows for this kind of thing, but it was never implemented. The current discussion is centered on implementing this (along with a splitting and coalescing allocator).</DIV> <DIV> </DIV> <DIV><BR>Now in order to be able to perform a O(1) fetching of a given record, we <BR>need to be able to directly jump at the position in this list where the <BR>physical offset is stored (i.e. without traversing this list). Only then <BR>large structures (or algorithms) built on a RecordManager (BTree etc.) <BR>really have the complexity they should have. If O(1) retrieval is not <BR>possible, then the complexity of a BTree changes from O(log n) to <BR>something like O(n*log n).<BR>As I guess, this is currently achieved via the assignment of logical <BR>ids: It consists of the block number and the offset in this block <BR>encoded into one long. Thus the block where the ids are stored can be <BR>fetched with only one read operation/seek and the record (if it's not <BR>too large) can always be fetched with two read operations/seeks. This <BR>allows for O(1) access.<BR>This implies of course that a translation page/block is never moved. So <BR>if you're planning to change this, what about the complexity of a record <BR>fetch? (I think I should read your other posts again... perhaps I <BR>understand them better now).</DIV> <DIV>-------------------------</DIV> <DIV>No question about it - a revised, movable translation page will require a more complex lookup mechanism. I am a bit weak on the theoretical aspect of data structure performance (it's been too long, and my back ground is in fluid mechanics, so my experience with calculation performance is more geared to numerical methods and differential equations...). What I can say is that even with a huge database (with all 55 bits worth of logical rows allocated - i.e. average record size is 8 bytes), the number of page reads for a lookup is 7. However, given that those pages may be scattered all over the file, it could be a problem with cache misses, etc...</DIV> <DIV> </DIV> <DIV>I'm not entirely sure if this is a *really bad idea*, or a decent design tradeoff - I am extremely open to suggestions.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>Another strategy here (but one that would not be backwards compatible with the current logical row ids) is to always force new translation pages to the front of the jdbm file (immediately following the file header page), even if it means moving another page type out of the way... The logical row id would then be a value that could be calculated to the page number and offset directly. That would provide O(1) performance no matter what size the database is. That is actually quite appealing, if we are OK with completely breaking binary compatibility.</DIV> <DIV> </DIV> <DIV>Another option would be to combine the two approaches. If the logical row id ends with a 0 bit (which all current ones do), then it gets looked up in the old translation page system. If the logical row id ends with a 1 bit, we use the new translation page system (which packs translation pages towards the front of the file so they don't have to be moved around). We then basically say that translation pages aren't allowed to be moved, no matter what... That would give us backwards file format compatability, and keep us from running into problems with needing to move translation pages around to achieve file packing.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>Regardless, we will want to re-implement the way that the free logical row ids are handled (If we keep the direct lookup strategy, I'm thinking that available slots would consist of a linked list of available slots - with the sparse tree strategy I'm thinking about, free slots are manged by internal binary priority queues, so it's not an issue). The current system is not at all efficient.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>Cheers!</DIV> <DIV> </DIV> <DIV>- K</DIV> <DIV> </DIV> <DIV> </DIV> <DIV> </DIV> <DIV> </DIV> <DIV><BR><BR> > Hi there<BR><BR>Kevin Day wrote:<BR>> [reverse loopup capability]<BR>> <BR>> So, this got me thinking about the record header...<BR><BR>The binary format of a jdbm file will have to change obviously. Then the <BR>question arises how old jdbm files can be converted to the new format.<BR>What are your plans concerning conversion of old databases? (If there <BR>are any ;)<BR><BR>> Let's say that we have a file completely full of records with an average <BR>> of 8 bytes each (I think this is a very conserative average). If we had <BR>> only data pages in the file, and we honor the current jdbm limit on the <BR>> maximum block id (2^47), we will need only 7 bytes to capture all <BR>> possible logical row ids. So, we can actually drop the new header size <BR>> down to 12 bytes. Just in case anyone cares, the average data size <BR>> would have to be 4096 bytes before we could recover an additional byte.<BR><BR>Just curious, where does this limit (maximum block id) exactly come from?<BR>And one other important thing to remember is that the logical ids are <BR>not assigned very "dense", i.e. they usually start at some value around <BR>13k and are increased by about 10 for each record. This means that there <BR>are some bits wasted. I don't know if you took this into account, but my <BR>approach would be more pragmatical: A logical id is a long. Period. ;)<BR>(Actually it's just a positive long, so the first bit could be used for <BR>other purposes, but that's it in my eyes.)<BR><BR>> Another way to approach this is to think about the biggest size that we <BR>> could reasonably see being used (the addressing above allows for file <BR>> sizes up to ~8,000,000 TB - yup 8 million Terabytes). This seems quite <BR>> excessive, but I'm not sure where the reasonable break point should be - <BR>> I'm sure that 20 MB used to sound like a lot...<BR><BR>This proves that the current file structure is pretty scalable for <BR>future use, and if you ask me, there should not be changed anything <BR>about it.<BR><BR>> Anyway, I think that implementing an effective defrag system is going to <BR>> require this kind of reverse pointer - I'd love to hear if any of you <BR>> have any other ideas.<BR><BR>One problem during defragging could be the assignment of logical row <BR>ids. If I am not totally mistaken, they're the position in a list of the <BR>physical positions of all records.<BR>Let me sum up how I understood the source code till now (I did not look <BR>pretty deep into it, most of this are guesses).<BR>We use logical ids to identify a record. Because records tend to be <BR>physically moved through the database, there's used a mapping <BR>id->physical position (that are the translation pages used for, right?). <BR>Now in order to be able to perform a O(1) fetching of a given record, we <BR>need to be able to directly jump at the position in this list where the <BR>physical offset is stored (i.e. without traversing this list). Only then <BR>large structures (or algorithms) built on a RecordManager (BTree etc.) <BR>really have the complexity they should have. If O(1) retrieval is not <BR>possible, then the complexity of a BTree changes from O(log n) to <BR>something like O(n*log n).<BR>As I guess, this is currently achieved via the assignment of logical <BR>ids: It consists of the block number and the offset in this block <BR>encoded into one long. Thus the block where the ids are stored can be <BR>fetched with only one read operation/seek and the record (if it's not <BR>too large) can always be fetched with two read operations/seeks. This <BR>allows for O(1) access.<BR>This implies of course that a translation page/block is never moved. So <BR>if you're planning to change this, what about the complexity of a record <BR>fetch? (I think I should read your other posts again... perhaps I <BR>understand them better now).<BR><BR><BR><BR>One tiny thing to Bryan: If you only "answer" to a message (e.g. from <BR>Kevin), then your answer is not posted to the list, thus we can't read <BR>them. (I would blame the sourceforge list software for this, as it <BR>should add a Reply-To header so that answers are sent to the list and <BR>not to the author of the original post). It's the same here with my mail <BR>client. I have to use "answer to all" and then manually remove the <BR>author of the original post in the adress field of my mail client.<BR><BR>Bye<BR>-Markus<BR><BR><BR>-------------------------------------------------------<BR>SF.Net email is sponsored by:<BR>Tame your development challenges with Apache's Geronimo App Server. Download<BR>it for free - -and be entered to win a 42" plasma tv or your very own<BR>Sony(tm)PSP. Click here to play: <A href="http://sourceforge.net/geronimo.php"><FONT color=#0000ff>http://sourceforge.net/geronimo.php</FONT></A><BR>_______________________________________________<BR>Jdbm-developer mailing list<BR><A href="mailto:Jdb...@li..."><FONT color=#0000ff>Jdb...@li...</FONT></A><BR><A href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT color=#0000ff>https://lists.sourceforge.net/lists/listinfo/jdbm-developer</FONT></A><BR><BR><</DIV></FONT></BODY></HTML> |