re[2]: [Jdbm-developer] More thoughts on new translation page

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
 Transitional//EN">
<HTML><HEAD>
<STYLE type=text/css> P, UL, OL, DL, DIR,
 MENU, PRE { margin: 0 auto;}</STYLE>

<META content="MSHTML 6.00.2900.2668" name=GENERATOR></HEAD>
<BODY leftMargin=1 topMargin=1 rightMargin=1><FONT
 face=Tahoma size=2>
<DIV>Markus-<BR></DIV>
<DIV>Good to hear from you.</DIV>
<DIV>&nbsp;</DIV>
<DIV>I'll reply to each of your questions/comments
 in turn (please pardon my stream of thought
 on some of this - I'm desiging as I go here...):</DIV>
<DIV>&nbsp;</DIV>
<DIV>Binary format having to change:</DIV>
<DIV>-------------------------------------</DIV>
<DIV>Yup - completely agreed.&nbsp; I have
 spent a lot of time thinking about this,
 and I really don't see any decent alternative.&nbsp;
 We could store the reverse&nbsp;lookup table
 in a separate part of the database, and just
 populate it by walking the logical row id
 list - that makes the migration easy - but
 it is considerably less than optimal.</DIV>
<DIV>&nbsp;</DIV>
<DIV>For cases where upgrading the file format
 is necessary, we may have to use an approach
 of reading records from an old file and writing
 them to the new format.&nbsp; This is defininitely
 a tricky aspect of things, but I think it's
 something that we do have to address.</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>Just curious, where does this limit
 (maximum block id) exactly come from?<BR>-----------------------------</DIV>
<DIV>Technically, it is hard coded in the
 Magic class&nbsp;- but the&nbsp;theoretical
 foundation of it is in allowing storage of
 physical location in a singe long.&nbsp;&nbsp;2
 bytes (16 bits)&nbsp;are reserved for the
 offset in the block, and a&nbsp;primitive
 long has 8 bytes (64 bits).&nbsp; Of those
 bits, 1 bit is used for sign, so we really
 have 63&nbsp;bits.&nbsp; Subtracting 16 bits
 from 63 gives 47, so the biggest&nbsp;block
 number is 2^47.</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>And one other important thing to remember
 is that the logical ids are not assigned
 very "dense</DIV>
<DIV>----------------------------</DIV>
<DIV>Actually, in my new translation page
 strategy, they will be very dense.&nbsp;
 For a brand new file, it will be pretty much
 perfectly dense.&nbsp; For transitional files
 (where we have to hold on to the old logical
 row ids), it will be sparsely populated,
 but clumped.&nbsp; The translation page needs
 to be overhauled anyway to allow it to move
 around.&nbsp; I can provide details on the
 new design I'm thinking about if you'd be
 interested in collaborating on the theoretical
 design (obviously, I'll accept offers of
 help for the coding/testing as well, but
 at this point I think it's more important
 to work out the theoretical design and make
 sure there aren't any fatal flaws).</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>This proves that the current file structure
 is pretty scalable for future use, and if
 you ask me, there should not be changed anything
 about it.</DIV>
<DIV>---------------------------</DIV>
<DIV>I agree.&nbsp; However, given the limitations
 of the current addressing scheme, we can
 reduce the address space of&nbsp;the logical
 row ids without reducing the current addressable
 size of the design.&nbsp; I am inclined to
 leave&nbsp;the current addressable range
 alone, but reduce the potential address space
 of&nbsp;the logical rowid so it can be captured
 in 7 bytes instead of 8.&nbsp; It will still
 be managed as a long in memory, but on disk,
 in the record header, it will be 7 bytes.</DIV>
<DIV>&nbsp;</DIV>
<DIV>The new translation page design has
 no such restriction (it can actually be extended
 relatively easily to allow for an infininte
 number of logical row ids, so if we wind
 up with a longlong primitive in Java&nbsp;v
 9 we can take advantage of it :-)&nbsp; ).</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>One problem during defragging could
 be the assignment of logical row <BR>ids.
 If I am not totally mistaken, they're the
 position in a list of the <BR>physical positions
 of all records.</DIV>
<DIV>------------------------------------------</DIV>
<DIV>Yup - thus the new design of the translation
 page system</DIV>
<DIV>&nbsp;</DIV>
<DIV><BR><BR>We use logical ids to identify
 a record. Because records tend to be <BR>physically
 moved through the database, there's used
 a mapping <BR>id-&gt;physical position (that
 are the translation pages used for, right?).
 </DIV>
<DIV>-----------------</DIV>
<DIV>Interestingly, this is what translation
 pages *can* be used for.&nbsp; In the current
 implementation of jdbm, physical records
 are never moved.&nbsp; Having the translation
 page allows for this kind of thing, but it
 was never implemented.&nbsp; The current
 discussion is centered on implementing this
 (along with a splitting and coalescing allocator).</DIV>
<DIV>&nbsp;</DIV>
<DIV><BR>Now in order to be able to perform
 a O(1) fetching of a given record, we <BR>need
 to be able to directly jump at the position
 in this list where the <BR>physical offset
 is stored (i.e. without traversing this list).
 Only then <BR>large structures (or algorithms)
 built on a RecordManager (BTree etc.) <BR>really
 have the complexity they should have. If
 O(1) retrieval is not <BR>possible, then
 the complexity of a BTree changes from O(log
 n) to <BR>something like O(n*log n).<BR>As
 I guess, this is currently achieved via the
 assignment of logical <BR>ids: It consists
 of the block number and the offset in this
 block <BR>encoded into one long. Thus the
 block where the ids are stored can be <BR>fetched
 with only one read operation/seek and the
 record (if it's not <BR>too large) can always
 be fetched with two read operations/seeks.
 This <BR>allows for O(1) access.<BR>This
 implies of course that a translation page/block
 is never moved. So <BR>if you're planning
 to change this, what about the complexity
 of a record <BR>fetch? (I think I should
 read your other posts again... perhaps I
 <BR>understand them better now).</DIV>
<DIV>-------------------------</DIV>
<DIV>No question about it - a revised, movable&nbsp;translation
 page will require a more complex lookup mechanism.&nbsp;
 I am a bit weak on the theoretical aspect
 of data structure performance (it's been
 too long, and my back ground is in fluid
 mechanics, so my experience with calculation
 performance is more geared to numerical methods
 and differential equations...).&nbsp; What
 I can say is that even with a huge database
 (with all 55 bits worth of logical rows allocated
 - i.e. average record size is 8 bytes), the
 number of page reads for a lookup is 7.&nbsp;
 However, given that those pages may be scattered
 all over the file, it could be a problem
 with cache misses, etc...</DIV>
<DIV>&nbsp;</DIV>
<DIV>I'm not entirely sure if this is a *really
 bad idea*, or a decent design tradeoff -
 I am extremely open to suggestions.</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>Another strategy here (but one that
 would not be backwards compatible with the
 current logical row ids) is to always force
 new translation pages to the front of the
 jdbm file (immediately following the file
 header page), even if it means moving another
 page type out of the way...&nbsp; The logical
 row id would then be&nbsp;a value that could
 be calculated to&nbsp;the page number&nbsp;and
 offset directly.&nbsp; That would provide
 O(1) performance no matter what size the
 database is.&nbsp; That is actually quite
 appealing, if we are OK with completely breaking
 binary compatibility.</DIV>
<DIV>&nbsp;</DIV>
<DIV>Another option would be to combine the
 two approaches.&nbsp; If the logical row
 id ends with a 0 bit (which all current ones
 do), then it gets looked up in the old translation
 page system.&nbsp; If the logical row id
 ends with a 1 bit, we use the new translation
 page system (which packs translation pages
 towards the front of the file so they don't
 have to be moved around).&nbsp; We then basically
 say that translation pages aren't allowed
 to be moved, no matter what...&nbsp; That
 would give us backwards file format&nbsp;compatability,
 and keep us from running into problems with
 needing to move translation pages around
 to achieve file packing.</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>Regardless, we will want to re-implement
 the way that the free logical row ids are
 handled (If we keep the direct lookup strategy,
 I'm thinking that available slots would consist
 of a linked list of available slots - with
 the sparse tree strategy I'm thinking about,
 free slots are manged by internal binary
 priority queues, so it's not an issue).&nbsp;
 The current system is not at all efficient.</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>Cheers!</DIV>
<DIV>&nbsp;</DIV>
<DIV>- K</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV><BR><BR>&nbsp;&gt; Hi there<BR><BR>Kevin
 Day wrote:<BR>&gt; [reverse loopup capability]<BR>&gt;
 &nbsp;<BR>&gt; So, this got me thinking about
 the record header...<BR><BR>The binary format
 of a jdbm file will have to change obviously.
 Then the <BR>question arises how old jdbm
 files can be converted to the new format.<BR>What
 are your plans concerning conversion of old
 databases? (If there <BR>are any ;)<BR><BR>&gt;
 Let's say that we have a file completely
 full of records with an average <BR>&gt;
 of 8 bytes each (I think this is a very conserative
 average). &nbsp;If we had <BR>&gt; only data
 pages in the file, and we honor the current
 jdbm limit on the <BR>&gt; maximum block
 id (2^47), we will need only 7 bytes to capture
 all <BR>&gt; possible logical row ids. &nbsp;So,
 we can actually drop the new header size
 <BR>&gt; down to 12 bytes. &nbsp;Just in
 case anyone cares, the average data size
 <BR>&gt; would have to be 4096 bytes before
 we could recover an additional byte.<BR><BR>Just
 curious, where does this limit (maximum block
 id) exactly come from?<BR>And one other important
 thing to remember is that the logical ids
 are <BR>not assigned very "dense", i.e. they
 usually start at some value around <BR>13k
 and are increased by about 10 for each record.
 This means that there <BR>are some bits wasted.
 I don't know if you took this into account,
 but my <BR>approach would be more pragmatical:
 A logical id is a long. Period. ;)<BR>(Actually
 it's just a positive long, so the first bit
 could be used for <BR>other purposes, but
 that's it in my eyes.)<BR><BR>&gt; Another
 way to approach this is to think about the
 biggest size that we <BR>&gt; could reasonably
 see being used (the addressing above allows
 for file <BR>&gt; sizes up to ~8,000,000
 TB - yup 8 million Terabytes). &nbsp;This
 seems quite <BR>&gt; excessive, but I'm not
 sure where the reasonable break point should
 be - <BR>&gt; I'm sure that 20 MB used to
 sound like a lot...<BR><BR>This proves that
 the current file structure is pretty scalable
 for <BR>future use, and if you ask me, there
 should not be changed anything <BR>about
 it.<BR><BR>&gt; Anyway, I think that implementing
 an effective defrag system is going to <BR>&gt;
 require this kind of reverse pointer - I'd
 love to hear if any of you <BR>&gt; have
 any other ideas.<BR><BR>One problem during
 defragging could be the assignment of logical
 row <BR>ids. If I am not totally mistaken,
 they're the position in a list of the <BR>physical
 positions of all records.<BR>Let me sum up
 how I understood the source code till now
 (I did not look <BR>pretty deep into it,
 most of this are guesses).<BR>We use logical
 ids to identify a record. Because records
 tend to be <BR>physically moved through the
 database, there's used a mapping <BR>id-&gt;physical
 position (that are the translation pages
 used for, right?). <BR>Now in order to be
 able to perform a O(1) fetching of a given
 record, we <BR>need to be able to directly
 jump at the position in this list where the
 <BR>physical offset is stored (i.e. without
 traversing this list). Only then <BR>large
 structures (or algorithms) built on a RecordManager
 (BTree etc.) <BR>really have the complexity
 they should have. If O(1) retrieval is not
 <BR>possible, then the complexity of a BTree
 changes from O(log n) to <BR>something like
 O(n*log n).<BR>As I guess, this is currently
 achieved via the assignment of logical <BR>ids:
 It consists of the block number and the offset
 in this block <BR>encoded into one long.
 Thus the block where the ids are stored can
 be <BR>fetched with only one read operation/seek
 and the record (if it's not <BR>too large)
 can always be fetched with two read operations/seeks.
 This <BR>allows for O(1) access.<BR>This
 implies of course that a translation page/block
 is never moved. So <BR>if you're planning
 to change this, what about the complexity
 of a record <BR>fetch? (I think I should
 read your other posts again... perhaps I
 <BR>understand them better now).<BR><BR><BR><BR>One
 tiny thing to Bryan: If you only "answer"
 to a message (e.g. from <BR>Kevin), then
 your answer is not posted to the list, thus
 we can't read <BR>them. (I would blame the
 sourceforge list software for this, as it
 <BR>should add a Reply-To header so that
 answers are sent to the list and <BR>not
 to the author of the original post). It's
 the same here with my mail <BR>client. I
 have to use "answer to all" and then manually
 remove the <BR>author of the original post
 in the adress field of my mail client.<BR><BR>Bye<BR>-Markus<BR><BR><BR>-------------------------------------------------------<BR>SF.Net
 email is sponsored by:<BR>Tame your development
 challenges with Apache's Geronimo App Server.
 Download<BR>it for free - -and be entered
 to win a 42" plasma tv or your very own<BR>Sony(tm)PSP.
 &nbsp;Click here to play: <A href="http://sourceforge.net/geronimo.php"><FONT
 color=#0000ff>http://sourceforge.net/geronimo.php</FONT></A><BR>_______________________________________________<BR>Jdbm-developer
 mailing list<BR><A href="mailto:Jdb...@li..."><FONT
 color=#0000ff>Jdb...@li...</FONT></A><BR><A
 href="https://lists.sourceforge.net/lists/listinfo/jdbm-developer"><FONT
 color=#0000ff>https://lists.sourceforge.net/lists/listinfo/jdbm-developer</FONT></A><BR><BR>&lt;</DIV></FONT></BODY></HTML>