From: Thompson, Bryan B. <BRYAN.B.THOMPSON@saic.com> - 2005-10-31 15:00:24
I have some interesting results to report when benchmarking our
application on jdbm. Our benchmark is essentially an insert test,
but it performs a variety of queries against data during the insert
so the locality of object reference is relatively large. The final
store size in 49M, so everything could run in memory during the
transaction (the test is a single transaction).
The baseline for my efforts used the "soft" cache with an internal MRU
of 1000. I also implemented a "CacheAll" option for benchmarking
purposes. It does not evict anything from the cache. Given the "lazy
insert" policy, this means that we defer all allocation of physical rows
and serialization until the commit, at which point we traverse the cache
and log everything.
1. The "lazy insert" change results in a 13% performance gain (measured
with the soft cache).
2. If I use record compression (all records) and play some games to
increase the record size (by combining some individual records within
a compound object), then I can get the store down to 39% of the space
for 117% of the time. I think that exploring record compression
further might pay off. More later on this. (Again, this was using
the "soft" cache.)
3. The "CacheAll" policy reduced the runtime until the start of the
commit by 50%! The total runtime was effectively unchanged.
When using the "soft" cache the high tide (maximum #of objects in the
cache at any one time) for the cache was 100k objects, but there were
300k persistent objects created during execution. This means that
2/3rds of the records were alloacted physical rows and serialized before
the commit (in response to cache evictions).
Based on this, I believe that we could realize a very significant
performance gain in jdbm by optimizing the process by which objects
leave the cache and are laid down on disk during a transaction. I am
running these tests on my laptop, but it can write 50M (6396 blocks)
in under 4 seconds using RandomAccessFile. jdbm is spending nearly
30 seconds to allocate the necessary physical rows, serialize the
objects, and write that much data.
When using the "soft" cache:
serialization time: 93ms
deserialization time: 1184ms
When using the "CacheAll" policy:
serialization time: 153ms
deserialization time: 0ms
From this is is clear that serialization costs are not dominating cache
evication and the commit. At the same time it is clear that raw block
I/O is not the dominating factor either. The bulk of these costs are
therefore related to the allocation of physical rows and to the manner
in which the dirty pages are being migrated to disk during the
I do not have a design to propose, but I am thinking in terms of:
- accumulating evicted records until we have several pages worth before
allocating their physical rows. (records could be serialized on
eviction so that we know the required allocation size.)
- clustering objects onto pages during eviction.
- re-implementing RecordFile and TransactionManager w/ DBCache (rather
than optimizing the existing transaction manager).
- exploring the tradeoff point between in place updates of records on
pages and re-allocation of records on dirty pages (an extreme version
of merging free records in which we re-allocate all records on a page
which has undergone updates which do not fit in place so that we can
take advantage of a batch-oriented physical row allocation and
- evaluation record compression options further. e.g., by aliasing more
than one record into the same physical row (using a some bits in the
translation page slot to dealias the records in the physical row when
it is unpacked).