re[8]: [Jdbm-developer] logical containers and translation pages (was More reverse lookup info)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Resend:  The SF listserv did something goofy to my post...

- K

Bryan-

>> for example I have objects that have a one-to-one relationship with a btree.

I think this might be a micro-optimization - the BTree object is a trivial read - all the data is actually stored in the BPages, and you definitely don't want to get into trying to put BPages into your containers (that would defeat the whole point of a =
BTree).

I think that the practical uses of this type of container are not orthogonal - for example, I could make the reverse argument that I might want to store your containers inside a BTree (via a string key, for example).

I really think that it is appropriate to implement these containers at a level where they are independent of the record manager - this will give you considerably more freedom in your use of the containers (for example, you can optimize the slot count of =
each container page based on the amount of data it will contain), and will make sure that if we do need to make any changes to the logical row id structure that we haven't painted ourselves into a corner.

I also think that this will make the container implementation better encapsulated, because you don't have to deal with promoting objects in the container itself.  You could still make promoting of objects transparent to your application with a thin =
interface layer (what you are doing with the record manager), but by keeping the container itself as a higher level object, it becomes much more accessible to the general jdbm community.

For example, in my current application, I have objects that serialize into large byte arrays.  If I were going to decide that I wanted to do the extra work to provide hints that certain of these objects should be attempted to be loaded together, I would =
absolutely not want to use the full 10 bits.  I might want to create a container that only holds 5 objects per page (so as not to wind up with a container that spans 30 or 40 disk pages).  I might want to create another container that could hold 500 =
objects per page for another type of object in my data store that serializes to a small byte array.

I'm sure it would be possible to add more settings, etc... to an implementation that was closely tied to the record manager, but I think that mixing the layers like this results in a less than optimal design for a *general* container (I'm certain that =
doing it in the record manager is optimal for your particular application - but I think the cost of doing it via the record manager is quite low, and that it would be better to have a general purpose container available to all users).

What do you think about calling this structure an "Associative Container" or a "Clumping Container" or something that is a bit more clear as to the purpose of the container?

I'm looking forward to hearing your comments!

- K

> Kevin,

I've been thinking about this also.  The role of the "container hint"
- place X, Y, Z into the same container as A - is to suggest to the
persistence layer that {A,X}, {A,Y}, {A,Z} are often co-materialized
and to take whatever steps it can to facilitate that.  One way to=20
facilitate that is using a compound record - much like a node of a
btree.

Doing this efficiently requires encoding the offset within a container
as part of the recid used by the application.  Right now I am wrapping
the cache/base record manager and shifting all recids N bits left to
make room for a N bit address space within a container.  Given what I
know about the translation pages, it is clear that I have 10 bits for
free, which is plenty.  However if the logical recids became dense it
would introduce a clear tradeoff between the address space within a
container and the #of addressable records.  I don't see any way around
that -- at least using this technique.

Unlike the btree or hash support, container hints can occur with any
record, even a record that is a btree, bpage, htree, etc.  In this
sense they are orthogonal to the traditional container techniques. In
fact some of these cases make good sense, for example I have objects
that have a one-to-one relationship with a btree.  The use of containers
means that I could have both my application object and the BTree in the
same physical record.

Another approach to containers would be to manage which physical
records fit on the same page (8K block).  If we had a reverse
lookup from physical records to translation pages, then we could
rearrange physical records, even ones that pre-exist, so that co-
materialization hints were respected.  That is not a good match=20
with the current design since, to the best of my knowledge, pages
are not "cached" in any sense and only the fetched record on a page
is actually deserialized.

-bryan

-----Original Message-----
From: jdb...@li...
To: ''JDBM Developer listserv ' '
Sent: 9/29/2005 6:35 PM
Subject: re[6]: [Jdbm-developer] logical containers and translation pa  ges
(wa s More reverse lookup info)

Bryan-

I think that your container belongs at the same level as BTree and HTree
- coupling it directly to the page manager would make our lives more
difficult down the road when we start implementing pack & rebuild and
page level optimizations (NIO, etc...).  Is that what you you mean by
"native jdbm feature"?  Just wanted to confirm.

At some point in the future if you start looking at alternative designs
for keeping the pages balanced, you may want to take a look at how BTree
handles balancing - the data structures are so similar that there may be
some hints in there that could be useful.

Cheers, and thanks for helping me understand how these things work.

- Kevin

> Kevin,

I've been working my way towards that.  With the current translation
page
scheme I think that I know how to take my existing code (which
implements
the RecordManager interface and delegates to the
Cache/BaseRecordManager)
and migrate it to be a native jdbm feature.  This is definately the way
to go.  However, I am not sure how to do this if we wind up changing the
jdbm translation pages.

With respect to secondary containers, the rule that I use is that
objects are inserted into a secondary container once the primary
container is full (there are no more address bits available).  This
is not at the "page" level, but at the "record" level.

An insert operation inserts into the primary container if there are
free addressable slots and otherwise into the tail of the linked list
of secondary containers.  I then rotate any seconday container which
is no longer full (through a delete of some record in that container)
to the tail.  This management strategy appears to work well.

Your point concerning correlation is correct.  The "container" is a
hint to the persistence layer, not a gaurentee that one object winds
up in the same record as another.

I have not explored this, but you could clearly move logical records
around within containers based on historical evidence of correlated
access paths.  Such an approach could doubtless be developed to place
objects within containers dynamically based on historical paths.

-bryan

-----Original Message-----
From:  <mailto:jdb...@li...>
jdb...@li...
To: 'JDBM Developer listserv '
Sent: 9/29/2005 12:46 PM
Subject: re[4]: [Jdbm-developer] logical containers and translation pa
ges
(wa s More reverse lookup info)

Bryan-

Fantastic.  I would love to see this added as another container type to
jdbm (BTree, HTree and logical container).  I have to admit that the
term "logical container" is not particularly descriptive...  Is there a
more formal name for this data structure?

Also, as an implementation detail, do you wind up splitting pages when
requests are made to store too many "related" objects?  Or do you just
add another page?  I'm asking because it seems that if I have object A
in the store, then I add object B and say that it's usage is correlated
to A but there isn't enough space, and that object gets placed on a new
page, then I add object C and say that it's usage is correlated to A, C
will be placed on the same page as B, but neither will be in proximity
to A.  I guess the point is that B and C may not be closely correlated
at all, even though they are both correlated to A.

Just a thought...

- K

> Kevin,

I believe that your analysis is essentially correct.  I will note that
using the DumpUtility I have found that the "container" objects tend to
find on one page, e.g., a useful container is still smaller than 8k in
practice (for our application).

This is basically a scheme for object locality. =20

All objects in the container are deserialized at once (like BPage).

The physical records are cached.  Once the container has been fetched
a subsequent request for any object in that container results in a cache
hit on the container and then the binary search on the keys to find the
already deserialized object.

In all, this has proven to be extremely effective for our application.
We see a 100% performance boost.

-bryan

-----Original Message-----
From:    <mailto:jdb...@li...>
<mailto:jdb...@li...>
<mailto:jdb...@li...>
jdb...@li...
To: JDBM Developer listserv
Sent: 9/29/2005 1:23 AM
Subject: re[2]: [Jdbm-developer] logical containers and translation
pages
(wa s More reverse lookup info)

Bryan-

Thanks for the description.

I'm still coming off a wicked cold, so my brain is a bit fuzzy, so bear
with me.  As with most things, concrete examples help, so I'm going to
try to put together a quick trace to help me understand:

logical row id =3D [12345][67]  <-- I'm using the [][] notation to reflect
values that are recovered by bit mask and shift operations - I don't
want to have to think too hard here, so let's just say we have an
arbitrary logical row id that allows us to yank 2 values from it.

You use [12345] to locate the first page of the container.  This page
basically contains an array of longs that act as page/offset values into
the logical container.  Are you then looking for the value 67 in the
keys using a binary search, then taking that index and mapping to the
object array?

OK - given that my understanding of the situation is as described above,
here are my comments:

1.  This approach requires that lots of objects be deserialized at the
same time.  This may or may not be a good thing, depending on the
application's needs (more on that in a second - I think there may be
some things about how the jdbm layers interact that hasn't been made as
clear as it should).

2.  If you happen to have an application that needs a data structure
that meets the following conditions, then this is a good data structure:

a)  Keys are longs (or some other primitive number - strings or more
complicated objects that require custom comparators are out of the
question), and you don't need to be able to specify the values of the
keys explicitly at the application level (i.e. you couldn't key off of
some value that was actually a long - such as a datestamp or anything
like that).

b)  There is a strong implicit, but not necessarily well defined,
relationship between groups of stored objects.

c)  You don't have to store very many "related" objects, or the objects
are really, really small.  In the current implementation, if a large
number of objects are being stored, then looking up a single value will
be very expensive in terms of disk reads.  In the example you give (with
object array sized at 2^10), and assuming a (very) modest serialized
data size per object of 32 bytes, you would wind up spanning 5 or 6
pages.  That's a lot of disk reading, and a lot of deserialization, just
to retrieve one object. =20

3.  Now that I'm thinking about it, I'm pretty sure that your solution
here is a degenerate BTree - basically, this logical container is a set
of BPage leafs, and your addressing scheme is used to allow direct
access to the page of interest without walking the tree.  If you look at
the BPage code, and remove all data members that are non-leaf related, I
think you will find that the data structures are identical.  Is there
anything else here that I might be missing?

So, to summarize, the real advantage of this data structure is that it
allows you to say "Store Object Y, and by the way, most of the time I'm
going to use it, I'm going to also be using Object X that I've already
stored".  This is certainly very, very useful in some situations
(although none come to mind at the moment - but that's more a reflection
of the type of work I do...), and removes the requirement of figuring
out how to allocate keys so they clump related objects together.

One thing I am unclear on, though, is how you handle the situation where
I insert an object and say it is related to another object that is on a
full page.  Do you just not worry about it at that point?  In a BTree,
the tree would actually be adjusted to provide proximity...  There's
certainly no reason that you couldn't have a similar mechanism, but I
think it would be very inneficient compared to the BTree approach of
splitting pages and balancing the tree.

OK, enough with the comments, now for a quick discussion about jdbm
IO...

In your description of the data structure, you said that one advantage
of this is fewer (but larger) IO transfers.  This is actually not the
case.

One thing that hasn't been talked about a lot in all of our discussions
about how jdbm works is the way IO occurs.  I'm extrapolating a bit from
your comment, so if I get it wrong, please gently correct me...  When
you read a record from the record manager, jdbm does not go to that
offset on the disk and read the bytes for that record.  jdbm *always*
performs it's disk IO one page at a time (2^13 bytes).  If you are
reading a record that starts on byte 2000, then it reads the entire
page, then does the offset into the page and yanks the bytes out for
deserialization.

This is the reason that I've been puzzled about your request for adding
pre-fetch to the system.  For all intents and purposes, there *is* a
pre-fetch mechanism, or as close as you can come to one in a general
purpose, random access IO system.  It would be possible to do a little
bit better here in terms of allowing the developer to specify a set of
logical row ids to fetch, then do a gathering read (there's that NIO
again!) to pull any pages that aren't in cache already.

This has a couple of implications that may have a big impact on the
perceived benefits of the logical container data structure over storing
individual records (this is not to say that there aren't advantages for
certain data sets, but I think that the following will remove any
concern you have about efficient disk access):

If you have a lot of small objects stored on a page (just to be
consistent with my meanderings above, let's say that we have 32 byte
objects, which will give you about 256 objects per page.  If you request
the 90th object on the page, the entire page is read into memory, the
bytes for just the 90th object are deserialized, and the object is
returned.  This is where the object cache comes in to play.  Any future
request for that object will return it from the cache, so you never have
to deserialize that object unless it is actually evicted from the cache
(both L1 and L2).

Another point of interest is that there is a page cache.  This means
that if you then turn around and read the 95th object, all that is
involved is a deserialization operation - no disk IO is required,
because the page is already in the page cache.  This, of course, makes
the assumption that you actually *want* to read object 90 and 95 in
close proximity to each other.

(as a side note - the size of the page cache is actually a
high-watermark, which is one of the reasons that it was important to
commit() even if you had transactions turned off - even after you
commit() the transaction, the total number of pages used in that
transaction will continue to be in memory ).

Some other comments:  I'm actually not at all concerned about the
"wasted" space in the current logical record id.  The amount of
addressable space in the file shrinks as you add more logical row ids
(i.e. the translation table grows).  This means, that you'd have to have
a file filled with 8 byte objects in order to run out of logical row ids
before you run out of addressable file space (to say nothing of the fact
that you'd have run out of physical drive space eons before that - as a
fun exercise, calculate how long it would take to write 2^63 bytes with
modern disk bus speeds...  it's 1 billion years...  the sun would not
quite have actually consumed the earth, but I think you get the picture
:-)  )

I was actually quite happy that the wasted space was there, because it
meant I could be tricky and shrink the reverse lookup value in the
record header so we don't clobber disk space in order to have reverse
lookup.

Enough for now - lot's of stuff to think about.

- K

> Kevin,=20

A  while back I posted on the notion of "logical containers."  I don't
believe  that I ever cleared up the=20
terminology to your satisfaction.  However, the recent thread on the
translation pages has given me=20
some  insight as to how to use such containers so as to consume those 10
remaining  bits in the=20
recid  in the current translation page scheme.  I'll sketch out the
approach that  I am suggesting.  It=20
has  the advantage that it packs many logical records (from the
application  perspective) inside of a=20
single  physical record (from the jdbm perspective).=20
<snip>

------------------------------------------------------- This SF.Net
email is sponsored by: Power Architecture Resource Center: Free content,
downloads, discussions, and more.
<http://solutions.newsforge.com/ibmarch.tmpl>
<http://solutions.newsforge.com/ibmarch.tmpl>
<http://solutions.newsforge.com/ibmarch.tmpl>
http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________ Jdbm-developer mailing
list    <mailto:Jdb...@li...>
<mailto:Jdb...@li...>
<mailto:Jdb...@li...>
Jdb...@li...
<https://lists.sourceforge.net/lists/listinfo/jdbm-developer>
<https://lists.sourceforge.net/lists/listinfo/jdbm-developer>
<https://lists.sourceforge.net/lists/listinfo/jdbm-developer>
https://lists.sourceforge.net/lists/listinfo/jdbm-developer

<
------------------------------------------------------- This SF.Net
email is sponsored by: Power Architecture Resource Center: Free content,
downloads, discussions, and more.
<http://solutions.newsforge.com/ibmarch.tmpl>
http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________ Jdbm-developer mailing
list  <mailto:Jdb...@li...>
Jdb...@li...
<https://lists.sourceforge.net/lists/listinfo/jdbm-developer>
https://lists.sourceforge.net/lists/listinfo/jdbm-developer

<
------------------------------------------------------- This SF.Net
email is sponsored by: Power Architecture Resource Center: Free content,
downloads, discussions, and more.
http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________ Jdbm-developer mailing
list Jdb...@li...
https://lists.sourceforge.net/lists/listinfo/jdbm-developer

<