From: Kevin D. <ke...@tr...> - 2005-09-23 17:56:00
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <STYLE type=text/css> P, UL, OL, DL, DIR, MENU, PRE { margin: 0 auto;}</STYLE> <META content="MSHTML 6.00.2900.2668" name=GENERATOR></HEAD> <BODY leftMargin=1 topMargin=1 rightMargin=1><FONT face=Tahoma size=2> <DIV>Bryan-<BR></DIV> <DIV>"If a data page is solely used as a continuation page for a record, does it show up in the used pages list? "</DIV> <DIV> </DIV> <DIV>KD - There is no used pages list, per se - just the linked list that actually contains the page. But, to answer the question: Sure. It's a used page so it is in the linked list (pointed to by the prior page).</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>"If so, what is the return value for getFirst() on such a data page since there is no record header?"</DIV> <DIV> </DIV> <DIV>KD - It returns 0. Each page has a header, so it is impossible for actual data to reside at position 0.</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>""interesting" information to extract from the free page list"</DIV> <DIV> </DIV> <DIV>KD - The data that I'm interested in is:</DIV> <DIV> </DIV> <DIV>a) The number of pages in the entire free page list</DIV> <DIV>b) Total bytes in the free page list</DIV> <DIV>c) Total DATA bytes in the free page list (i.e. total bytes less header bytes)</DIV> <DIV>d) Number of free pages at the end of the file</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>One general comment: Having free pages scattered around the file is not a bad thing. The file system is doing this anyway, so any concern about whether two blocks that are adjacent in the list are also adjacent on disk isn't going to be worth it. Once you are reading data in pages, the actual location of those pages becomes much less important.</DIV> <DIV> </DIV> <DIV>The only time that it actually does matter is if we are trying to shrink the size of the database file. If we are going to be going there (and I'm not entirely certain that it is really necessary, given most usage patterns), then pages can be moved physically towards the front of the file without changing their logical location in the linked list. For the pages in the free list, this can be done easily. For the other page types, we'd have to make sure that we properly update the translation table and free lists...</DIV> <DIV> </DIV> <DIV> </DIV> <DIV>Cheers!</DIV> <DIV> </DIV> <DIV>- K</DIV> <DIV> </DIV> <DIV><BR>-------------------<BR> > Kevin, <BR><BR>Two questions related to the traversal of the "used pages" list. If a data page is solely used as a continuation page <BR>for a record, does it show up in the used pages list? If so, what is the return value for getFirst() on such a data page <BR>since there is no record header? <BR><BR>Also, do you have thoughts on what would be "interesting" information to extract from the free page list? I am thinking <BR>of a free page count and perhaps determining how much fragmentation there is in the free page list. Which gets back <BR>to an earlier question -- does jdbm memory allocation make any effort to allocation contiguous pages for large records? <BR><BR>-bryan <BR><BR>-----Original Message-----<BR>From: <A href="mailto:jdb...@li..."><FONT color=#0000ff>jdb...@li...</FONT></A> <A href="mailto:jdb...@li..."><FONT color=#0000ff>[mailto:jdb...@li...]</FONT></A> On Behalf Of Thompson, Bryan B.<BR>Sent: Friday, September 23, 2005 8:15 AM<BR>To: Kevin Day<BR>Cc: <A href="mailto:jdb...@li..."><FONT color=#0000ff>jdb...@li...</FONT></A><BR>Subject: RE: [Jdbm-developer] commit: jdbm.recman.DumpUtility<BR><BR><BR>I'll try this today. One thing that I am not clear about with jdbm is whether it makes any attempt (or guarentee) <BR>that records which span a page will be on contiguous pages. <BR><BR>As I mentioned, I have been playing with some blob/clob support. My original take was to have the records form <BR>a linked list. Since this is exactly what the jdbm record headers are doing, it should be possible to incrementally <BR>allocate new pages into a jdbm record, thereby supporting streaming from the application in addition to the current <BR>Serializer approach. An interesting thought. <BR><BR>However, it occurred to me that greater efficiency could be obtained by blocking the linked list of records (in my <BR>current implementation of blob/clob) into an array of recids in a header record for the blob. That would make it <BR>possible to do pre-fetch strategies for subsequent segments of the blob. If we use the record header approach as <BR>it stands today, the most pre-fetch that we could do is one page read ahead at a time. <BR><BR>You mentioned in another thread I/O efficiency. The guiding principle as I understand it is that you want to get as <BR>much I/O concurrency as possible so that you can get as many disk arms behind your application as possible. <BR>This leads to the use of striped disk arrays and cluster storage solutions. jdbm today is single threaded for read <BR>and write, so it is not possible to get I/O concurrency. Even if youhave I/O concurrency at the store layer, your <BR>application has to support it as well. For us, I/O concurrency is gained by high level query languages that can <BR>use parallel read operations against the store or (in a different application which is a fuzzy inference engine based <BR>on a neural network model) by being able to model the main computation using a parallel processing approach. <BR><BR>One change that I would like to introduce (if we wind up introducing changes in the jdbm file structure) is a long <BR>class identifier in the record header. This would make it possible to accurrately profile the contents of a jdbm <BR>store. For certain key classes (BTree, BPage, HashDictionary, HashBucket, String, Long) we would have "magic" <BR>pre-defined values that used negative recids. All other classes would be assigned recids by interning the class name <BR>in a string table. The string table itself could be just a BTree using a compressed key index whose keys are the <BR>string values and whose values are the jdbm records whose content is that string. The latter is necessary so that <BR>you can lookup the class name. <BR><BR>I am certainly going to do this at the object manager layer for our application since some use of Externalizable or <BR>even Serializable appears to bewhy the store is so bloated (I had no idea just how bad java serialization was!). I <BR>think that it would be a nice feature for jdbm, but not one that we could introduce while maintaining binary compatibility <BR>in thestore file. <BR><BR>At the same time, I would also like to introduce version numbers to critical classes (BTree, BPage, etc.) so that we <BR>have more flexibility in evolving jdbm without breaking binary compatibility. <BR><BR>-bryan <BR><BR>-----Original Message-----<BR>From: Kevin Day <A href="mailto:ke...@tr..."><FONT color=#0000ff>[mailto:ke...@tr...]</FONT></A> <BR>Sent: Thursday, September 22, 2005 11:26 PM<BR>To: Thompson, Bryan B.<BR>Subject: re: [Jdbm-developer] commit: jdbm.recman.DumpUtility<BR><BR><BR>Bryan- <BR><BR>Thanks for picking this up right now - I am completely slammed with work that I'm not able to really write any jdbm code. For some reason it's a lot easier for me to think of conceptual design and algorithms in the evening than actually bang on the keyboard. <BR><BR>Things should clear up a bit in the next week or so. Please let me know if you need a hand walking the data pages - I think that is going to be your best bet for getting a solid unused space percentage. <BR><BR>- K <BR><BR> <BR>>I finally got through the sourceforge CVS. I will pick up work on the free lists tomorrow. -bryan <<BR><</DIV></FONT></BODY></HTML> |