[Jdbm-developer] ExtensibleSerializer for jdbm.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello,

I have been working on an extensible serialization scheme for jdbm.
What follows is a description of the approach and of the changes
required to integrate this into the jdbm code base.  At this point the
basic framework works ;-) all test suites pass (-; and I am seeking
feedback on the approach before committing the changes or finalizing
the approach.

The basic approach is that an [int] classId is used to decode which
class was serialized and the appropriate Serializer is looked up based
on that classId.  Two high bits are reserved in the classId.  One of
these bits indicates that a [byte] versionId follows, otherwise the
versionId is zero.  The other bit indicates that the recid of a
Serializer follows, in which case that serializer is fetched and
applied to deserialize the object.  The mapping from classId to class
name and the mapping (for each classId) from the versionId to the
serializer for that versionId are maintained in some hash maps, which
constitute the state of the extensible serializer.

You can explicitly register classes and their serializers against the
extensible serializer.  New classes are transparently registered
during serialization and use default Java serialization if they
implement Serializable or Externalizable, otherwise it is a
serialization error.  When new classes or serializers are registered,
the state of the ExtensibleSerializer is updated against the store.
The overhead is an [int], plus a [byte] if you use the versionId, plus
a [long] if you have an explicit serializerId, for a normal overhead
of 4 bytes and a maximum overhead of 13 bytes.  If you were willing to
live with a limit of 2^14 (16,384) distinct classes, then you could
bring those numbers down by 2 bytes using a [short] to store the
classId.  I could also see moving to a [short] for the versionId.

Serializers are applied at two logical levels within jdbm.  The first
is the physical record whose (de-)serialization is handled by the
BaseRecordManager.  When caching is used, the update operation is
deferred and the serializer to be applied during update is stored in
the cache entry along with the reference to the object and its recid.
When the entry is evicted from cache, the serializer noted on the
cache entry is applied to generate the serialized form for the
physical record.

The second is for "compound records", such as the BPage class.  BPage
uses the optional key and value serializers specified on the owning
BTree and default Java serialization when the key or value serializer
was not specified (null).  When the compound record is heterogenous
then the use of the extensible serializer makes good sense.  However,
if you have homogenous compound records then it makes sense to use a
custom serializer rather than an extensible serializer since you can
realize a significant space savings.  For example, if you have a BTree
and all keys are one datatype (e.g., Long) and all values are another
datatype (e.g., String), then using custom key and value serializers
would save 4 bytes per key and value when compared to the extensible
serializer (and the use of versionId and serializerId information
could interfere with key compression).  A mixed model with homogenous
keys and heterogenous values is of course possible.  Heterogenous
values are persisted using Java serialization.

The proposed extensible serialization approach has a variety of
advantages.  If you do not opt in it has no impact on the binary
format of the store.  If you opt in then you can optimize the
serialization of your classes by configuring the extensible
serializer.  You can also use optimized serialization with
heterogenous objects, e.g., a BTree whose keys are of a homogenous
type and whose value types are heterogeneous.  Any value that has a
registered custom serializer will be read and written using that
serializer.  If you use the extensible serializer and do not specify a
custom serializer for some value class, then it will use default
serialization for now but you can always add a versioned serializer
for that class later and transparently migrate to a more effecient
serialization technique on a class by class basis.

By default all classes are unversioned, but you can modify the Class
implementation at any time to specify a versionId (it is identified by
reflection and then cached), at which point existing instances will be
read using the serializer for the unversioned class and written using
the serializer for the versionId currently marked on the class.  New
versions can be introduced over time and old versions can always be
read (as long as the appropriate Serializer is still available).

Using a new RecordManagerOption (SERIALIZER_CLASS) you can opt in to
the extensible serializer by naming a singleton that proxies for the
extensible serializer (ExtensibleSerializer.Singleton) as the value of
the SERIALIZER_CLASS property.  Code which specifies an explicit
Serializer in recman API calls will continue to use that Serializer
(so the RecordManager API semantics do not change), but code which
uses the implicit serializer, e.g., insert( obj ), fetch( recid ), and
update( recid, obj ), will use the configured default serializer.  For
backward compatibility the default value for this property is
"jdbm.helper.DefaultSerializer".  A new method on the RecordManager
interface, getDefaultSerializer(), returns the configured serializer
instance.

BTree has optional key and value serializers.  When [null], these
currently result in the use of Java serialization (writeObject is
called on the key or value).  When a serializer is specified, the key
(or value) is serialized and then a [short] length is written followed
by the serialized byte[].  It would be nice to change this to use the
configured default serializer for the record manager, which has the
same semantics when the SERIALIZER_CLASS property has its default
value, since that would make it transparent to change the default
serialization for btrees with the by setting the SERIALIZER_CLASS
property.  However that change would NOT be binary compatible change
since the [short] length would always be written (which is not the
case today).

The state of the extensible serializer is persisted in the store.  Its
state consists entirely of data which would be ammenable to a flat
file representation, which opens up the possibility of offline
configuration of the extensible serializer.  The relevant data are:

  class name -> classId

  class name + versionId -> serializer class name

I ran into two problems when I tried to integrate the extensible
serializer into jdbm.  The fundamental problem is that the recman
reference (and to a lesser extent the recid) are not available during
(de-)serialization.  This means that the state of the extensible
serializer can not be updated against the store and also that explicit
serializers (specified by recid) can not be fetched from the store.  I
tried a variety of approaches to get around this problem and finally
decided that I needed to change the Serializer API (see below).  I am
open to suggestions here.  Kevin suggested a thin layer over
BaseRecordManager, but that seems like it would conflate the
serialization choices with subclassing of the record manager, which I
think should be orthogonal choices.

I would like to propose that we address this problem (serializers must
be stateless) by adding the recman reference (and the recid) to the
Serializer API.  Since serializers are sometimes invoked within other
serializers, e.g., BPage uses key and value serializers, the change
would have to be made manifest in all Serializers rather than just
declaring a derived interface.  I think that adding the recid (in
addition to the recman reference) to the Serializer API makes good
sense.  It makes it much easier to write serializers that touch up the
transient state of a deserialized object to encapsulate the recid.
During insert the recid is unknown, which is marked by a 0L value.

So, I propagated the proposed change through the jdbm code easily
enough.  This will not effect binary compatibility, but it will touch
the use of Serializers in BaseRecordManager and BPage.  The impact of
the change on existing code would be that existing Serializers outside
of the jdbm code base would have to be modified to accept two
additional parameters, as follows:

public interface Serializer {

   public byte[] serialize
      ( RecordManager recman, long recid, Object obj
        );

   public Object deserialize
      ( RecordManager recman, long recid, byte[] serialized
        );

}

In addition, code which uses the Serializable or Externalizable
interface for persistence must be updated if it uses Serializers
internally, i.e., if it is implementing some sort of compound record.
The problem is that Java serialization is stateless (in the sense that
it does not provide access to the recman reference or the recid).
With the changed Serializer API such code is now unable to invoke a
Serializer.

To support people who use Java serialization I wrote some trivial
wrappers for java.io.ObjectInputStream and java.io.ObjectOutputStream
that carry the necessary state (the recman reference and the recid).
These classes are jdbm.helper.ObjectInputHelper and
jdbm.helper.ObjectOutputHelper.  The following classes were modified
to use these new helper classes:

   BPage
   Serialization

I also made a "patch" to jdbm.helper.ObjectBAComparator since it does
not have access to the recman or the recid for the serialized objects
that it is comparing.  There are no consumers of this class within the
jdbm distribution, so I was not sure what to do in this case.  Without
access to use cases for this class I can't make a definitive fix, but
I can't imagine that it would be difficult to find a good solution.

Another issue that shows up within the context of the proposed
Serializer API change is that it is difficult to layer RecordManager
APIs.  The situation arises when you want to invoke an operation in
terms of the "outer" record manager semantics, but you only have
access to the "inner" record manager.  Consider:

        "outer"       ->     "inner"

   CacheRecordManager -> BaseRecordManager

If we want the extensible serializer to take advantage of the
buffering offered by the caching layer, then we really want to be
passing through the CacheRecordManager reference.  This can be
finessed since the cache layer is introduced by the Provider, but it
would be nice to have a more general solution.  Consider:

        "outer"               ->                "inner"

   MyRecordManager -> CacheRecordManager -> BaseRecordManager

Some systems that support intemediaries provide very complex request
objects, which I think we can do without.  I would like to suggest
that we add setter and getter methods for an "outerRecordManager"
property to address this issue.  The property would be set when the
record manager stack is configured.  It could be used to make sure
that operations take place in terms of the "outer" record manager when
necessary.  A getRecordManager() method could return the one step
delegate and a getBaseRecordManager() would return the record manager
that was the eventual delegate (BaseRecordManager).

All in all this appears to require broader changes than I was
expecting (or hoping).  However I trace these changes back to two
issues: (1) Serializer is stateless; and (2) awareness of the Record
Manager stack.  All of the changes that I have made follow from those
issues.

Cheers,

-bryan