RE: [Jdbm-developer] commit: jdbm.recman.DumpUtility

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

FYI - I've committed the updated version of jdbm.recman.DumpUtility and also
README-alloc.txt on the
memory management mechanisms. -b

-----Original Message-----
From: Thompson, Bryan B. 
Sent: Friday, September 23, 2005 10:21 AM
To: Thompson, Bryan B.; 'Kevin Day'
Cc: 'jdb...@li...'
Subject: RE: [Jdbm-developer] commit: jdbm.recman.DumpUtility

One more question: How can we correctly detect when there are no more
records on a page?  What I am seeing is that
the rest of the page is filled with zeros, so it appears like records with
zero size and zero capacity.  However the dump
could fail if this is not always the case. -bryan

-----Original Message-----
From: Thompson, Bryan B. 
Sent: Friday, September 23, 2005 10:11 AM
To: Thompson, Bryan B.; Kevin Day
Cc: jdb...@li...
Subject: RE: [Jdbm-developer] commit: jdbm.recman.DumpUtility

Kevin,

Two questions related to the traversal of the "used pages" list.  If a data
page is solely used as a continuation page
for a record, does it show up in the used pages list?  If so, what is the
return value for getFirst() on such a data page
since there is no record header?

Also, do you have thoughts on what would be "interesting" information to
extract from the free page list?  I am thinking
of a free page count and perhaps determining how much fragmentation there is
in the free page list.  Which gets back
to an earlier question -- does jdbm memory allocation make any effort to
allocation contiguous pages for large records?

-bryan

-----Original Message-----
From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Thompson,
Bryan B.
Sent: Friday, September 23, 2005 8:15 AM
To: Kevin Day
Cc: jdb...@li...
Subject: RE: [Jdbm-developer] commit: jdbm.recman.DumpUtility

I'll try this today.  One thing that I am not clear about with jdbm is
whether it makes any attempt (or guarentee)
that records which span a page will be on contiguous pages.

As I mentioned, I have been playing with some blob/clob support.  My
original take was to have the records form
a linked list.  Since this is exactly what the jdbm record headers are
doing, it should be possible to incrementally
allocate new pages into a jdbm record, thereby supporting streaming from the
application in addition to the current
Serializer approach.  An interesting thought.

However, it occurred to me that greater efficiency could be obtained by
blocking the linked list of records (in my
current implementation of blob/clob) into an array of recids in a header
record for the blob.  That would make it
possible to do pre-fetch strategies for subsequent segments of the blob.  If
we use the record header approach as
it stands today, the most pre-fetch that we could do is one page read ahead
at a time.

You mentioned in another thread I/O efficiency.  The guiding principle as I
understand it is that you want to get as
much I/O concurrency as possible so that you can get as many disk arms
behind your application as possible.
This leads to the use of striped disk arrays and cluster storage solutions.
jdbm today is single threaded for read
and write, so it is not possible to get I/O concurrency.  Even if you have
I/O concurrency at the store layer, your
application has to support it as well.  For us, I/O concurrency is gained by
high level query languages that can
use parallel read operations against the store or (in a different
application which is a fuzzy inference engine based
on a neural network model) by being able to model the main computation using
a parallel processing approach.

One change that I would like to introduce (if we wind up introducing changes
in the jdbm file structure) is a long
class identifier in the record header.  This would make it possible to
accurrately profile the contents of a jdbm
store.  For certain key classes (BTree, BPage, HashDictionary, HashBucket,
String, Long) we would have "magic"
pre-defined values that used negative recids.  All other classes would be
assigned recids by interning the class name
in a string table.  The string table itself could be just a BTree using a
compressed key index whose keys are the
string values and whose values are the jdbm records whose content is that
string.  The latter is necessary so that
you can lookup the class name.

I am certainly going to do this at the object manager layer for our
application since some use of Externalizable or
even Serializable appears to be why the store is so bloated (I had no idea
just how bad java serialization was!).  I
think that it would be a nice feature for jdbm, but not one that we could
introduce while maintaining binary compatibility
in the store file.

At the same time, I would also like to introduce version numbers to critical
classes (BTree, BPage, etc.) so that we
have more flexibility in evolving jdbm without breaking binary
compatibility.

-bryan

-----Original Message-----
From: Kevin Day [mailto:ke...@tr...] 
Sent: Thursday, September 22, 2005 11:26 PM
To: Thompson, Bryan B.
Subject: re: [Jdbm-developer] commit: jdbm.recman.DumpUtility

Bryan-

Thanks for picking this up right now - I am completely slammed with work
that I'm not able to really write any jdbm code.  For some reason it's a lot
easier for me to think of conceptual design and algorithms in the evening
than actually bang on the keyboard.

Things should clear up a bit in the next week or so.  Please let me know if
you need a hand walking the data pages - I think that is going to be your
best bet for getting a solid unused space percentage.

- K

	> I finally got through the sourceforge CVS.  I will pick up  work
on the free lists tomorrow.  -bryan <