RE: re[2]: [Jdbm-developer] record managment

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The hotspot that I described is physical row allocation.  Insert and fetch
both drive that since it forces objects to be evicted from the cache, but
they are not otherwise a hotspot.  I think that the design I am proposing
could address this hotspot by creating a free space between the data and
slot maps.  I am sure that there are other ways to solve the problem as
well.

However, we have been primarily benchmarking an insert intensive scenario
which computes the closure of a knowledge base as it loads statements into
the KB.  Read optimization is definitely going to be an issue for the
application and the ability to read objects directly from their OID could be
a huge win, as could clustering of OIDs and/or objects.  Basically, we need
set-oriented read operations for joins and the like - and that requires
clustered data to be successful.  If you are doing a broad scan of
statements in the kb, e.g., all "Persons", then you might actually see the
full 2:1 performance benefit for a large KB since, in that situation, you
would be reading a large #of statements without revisiting any and you could
burn right through the cache.

I think that this is really where this becomes an issue - when a read
operation is accessing large swaths of persistent data in access patterns
where the cache gives you limited benefit.  E.g., you need to read 10x or
100x more pages from disk than will fit into your cache.  We could keep a
separate cache for translation pages, but you see my point.

Basically, I would like to explore various solutions here and see what
works.  As you indicated, I would like to see all of the components isolated
enough so that we can plug and play some different strategies and measure
them against some applications of interest.

-bryan

  _____  

From: jdb...@li...
[mailto:jdb...@li...] On Behalf Of Kevin Day
Sent: Tuesday, April 11, 2006 2:06 PM
To: JDBM Developer listserv
Subject: re[2]: [Jdbm-developer] record managment

Bryan-

The numbers you provide are definitely instructive...  If you attach a
profiler to your application, what percentage of the processing time is
being spent in lookup?  I suspect that inserts are where the majority of the
performance hit is coming from, so I don't want to get too far into talking
about ways of tweaking the last bit of performance out of reads, etc...

My general opinion here is that if we want OIDs to have locality, that we do
it by providing batch insert operations on the record manager (i.e. the
application could call insert(Object[]) and get back a list of contiguous
OIDs - the translation page allocator would then do whatever it had to to
get contiguous blocks.  I'm not sure that a complete re-think of the
dedicated translation page thread is necessary to achive this goal.  Once
you have OIDs all on a page, then you can have the record allocator move
records around as the application sees fit to obtain locality.  Reading of
all of the objects with 'locality' to each other would involve two page
reads.

I think that doing a lot of work to reduce 2 page reads to 1 page read may
not really gain us much in terms of overall system improvement...

The good news is that the proposed MVCC architecture should be completely
independent of the paging system, so we will be free to play with different
choices if we see fit.

I do want to explore some of the potential hotspots in the proof of concept
and see if you have any thoughts on object to page cache interaction - I'll
write that in a separate email.

- K

> Kevin,

Comments below.

-bryan

________________________________________
From: jdb...@li...
<mailto:jdb...@li...> 
[mailto:jdb...@li...]
<mailto:jdb...@li...>  On Behalf Of Kevin Day
Sent: Monday, April 10, 2006 9:36 PM
To: JDBM Developer listserv
Subject: re: [Jdbm-developer] record managment

Bryan-

Interesting - this is similar to something that I saw in the (cricket?)
architecture for a threaded log based db.

A couple of questions/comments that may spark some discussion points or new
directions:  

1.  It seems that the proposed idea would work well until records are
updated in a manner that increases their size.  At that point, it seems
unlikely to me that all of the records on a given page would remain on that
page, so you'd have some entries that would overflow into other pages.

BBT: Yes, the notional design is that each page has a slot map and a data
space.  The slot map records the offset of the physical record on the page.
If there is an overflow situation, then the slot map operates as a
translation slot and records the _slot_ of the record on a _different_ page.
At most one indirection could be performed.  If all slots on a page are
indirected, then this is the same as the current translation page solution.

2.  I'm pretty sure that this would preclude the ability of ever packing and
shrinking the database...  If I allocate tons of objects, then delete all
but one on each page, I can't shrink the database or otherwise optimize it
in any way.  I have been envisioning a system that always places translation
pages at the front of the db file, even if it means moving data pages out of
the way during an allocation.  In that way, the db can always be shrunk down
to the size of the number of translation pages that contain active slots.

BBT: This design could degenerate into the translation page design (per
above), but it is oriented towards reducing lookup for objects over
compaction of the store.

3.  This design does allow for some form of space optimization without
incuring the cost of updating the translation table itself - which
definitely has bennefits.

4.  It seems that this would reduce the efficiency of the translation paging
cache considerably - instead of having ~2000 slots on a page, we'd probably
wind up with ~20-50 (obviously, it depends on the size of the data record). 
In the multi-version scenario, if a version table has to be written to disk
at all (and in the vast majority of cases it will not), this *might* force a
larger number of page writes to the safe than is strictly necessary.  Given
the non-locality of recids, this may or may not actually be a consideration,
though.

BBT: I do not see how it could reduce the efficiency of the translation page
caching.  I think that the better analogy is that translation pages are an
overhead that we can avoid in many scenarios, but that the design can
degrade to the same performance as the translation page design.  Also, I am
very much thinking in terms of clustering support with this design.

5.  One way of addressing my concern with #2 (which is along the lines of
the cricket architecture) is to use a b-tree to manage the offsets of the
pages themselves.  Each node in the BTree would represent the starting point
of a contiguous block of pages - so if you have 50 pages on disk that are
ordered by their logical ordering, it would require a single BTree node. 
This allows pages to be moved to arbitrary locations (and, in fact, sets you
up for infinite stores - you just have occasional checkpoints that include
an entire BTree).

That, of course, re-introduces a lookup...

6.  Locality take a bit of a hit in terms of what is easy to implement vs
not - if an application wants to request that the recman place a set of
objects near each other, it can only do so (efficiently) in the context of
the initial inserts of the objects (and only then if the allocator assigns
recids that are near each other).

BBT: Yes, this is an interesting issue.  It seems that the more information
available to drive a clustering strategy the better.  When an object is
first created the most information that you might expect to have on hand is
some notion of a "container".  If you are able to defer until the object has
been linked, then you can leverage the links and metadata about link classes
to make clustering decisions.  I see several ways of handling this: (1) use
the translation page mode of the design and defer the placement of the
physical records onto pages until more information is available.  This would
support clustering of physical records, but not of the OIDs.  The existing
jdbm defers serialization and writing onto the page image in an appropriate
manner but does not use a clustering strategy to guide the choice of pages
onto which to place the physical records.  (2) use a policy in the
application framework that supports persistence by reachability such that
the OID assignment may be deferred.  In this case, the OIDs could be
clustered as well based.  (3) use the notion of a "container" for clustering
OIDs.  Objects inserted into a container will be clustered together.

BBT: This highlights a possible problem with the use of translation pages.
If the OIDs are not clustered, then a large store may demonstrate thrashing
in the translation pages since the OIDs themselves do not demonstrate
locality of reference.  Clustering at persistent objects at insert appears
to be required to minimize disk reads since otherwise reads of translation
pages may be scattered.  Here are some sample numbers.

#of objects in store: 10M
average size of serialized object: 80 bytes.
per object overhead: 8 bytes (record header) vs 11 bytes (slot size)
adjusted average object size: 88 vs 91 bytes
#of translation pages (jdbm 1.x): 12,237
average #of objects per data page: 93 vs 90
#of data pages: 107,422 vs 111,084
Total #of pages: 119,659 vs 111,084

Now I am not convinced that the translation page design is bad.  It does
introduce an indirection, but if there is reasonable locality in the OIDs
then the translation pages will be cached.  If we do not have locality in
the OIDs, then the new design will still have to do a lot of page reads for
the physical records since they will not be effectively clustered.

The translation table design appears to be marginally more space efficient,
but this is entirely a function of the difference in the per-object overhead
(3 bytes), so little differences get magnified by the #of objects.

Kevin wrote:

Of all of the above, the only one that I'm actually really concerned about
is #2 - the rest can be addressed with fancy bookkeeping (a pain in the neck
to develop, but it should be doable).  I suspect that #5 is not going to be
palatable, because it kind of defeats the purpose of doing this in the first
place.

BBT: I admit that I am less concerned with compaction of the store and more
concerned with read performance and with the overhead associated with
managing free space in the store.  This is still the largest hotspot in jdbm
1.x and takes 10% of the runtime for our application.

I think that the new design will provide faster management of free space on
pages.  The basic principle is to divide each page into a slot map
(allocated like a stack - top down) and a data space (growing bottom up).
The data space would be dense.  The free space would exist between the end
of the data space and the bottom of the slot map.  Operations on the page
would be atomic (thread safe).  Write operations could reorder the data
space and the slot map.  Write operations could also force a physical record
to become indirected (translated to another page) or to become direct (on
page).

-bryan

- K

> 
All, 

I have been re-thinking record management some.  I want to share some of my
thoughts here.  I also want to see whether the version management
requirements for kevins MVCC module could be satisfied by such a design. 

The main insight is that the existing translation page design could be
rethought as one possible extreme in a more flexible, and hopefully more
efficient, design.  The basic change is that each page would incorporate the
capability to store either and/or logical to physical translation slots or
directly encode the physical record.  The same slot map would be reused for
both purposes with appropriate coding to indicate an on-page record vs. a
record that was indirected to another page. 

Setting aside the page header for the minute, the design would have records
growing up from the bottom of the page (lower address) and the slot map
growing down from the top of the page.  The lower bits of the OID would
address slots in the page.  Those slots would either give an on-page offset
or the pageId and slot off the record on another page.  The data space and
the slot map space on the page would be kept dense, so there would never be
free space between records.  This should be aconsistent data structure and
reconciliation of page images should be possible. 

This design allows us to directly lookup records by the OID without
indirection.  In the best case this means that we do one fetch for an object
vs two.  If the record is one page, then we are done.  Otherwise we indirect
and the effect is exactly as with the existing translation pages.  In the
extreme, if all records are indirected then the design is essentially the
same as the existing translation page design. 

I am still playing around with record management rules for the page and
there are clearly many designs that are possible and clustering remains an
important issue. 

Got to run.  

-bryan <

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd <http://sel.as-us.falkag.net/sel?cmd> 

<

------------------------------------------------------- This SF.Net email is
sponsored by xPML, a groundbreaking scripting language that extends
applications into web and mobile media. Attend the live webcast and join the
prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________ Jdbm-developer mailing list
Jdb...@li...
https://lists.sourceforge.net/lists/listinfo/jdbm-developer