From: Thompson, B. B. <BRY...@sa...> - 2005-10-03 20:50:43
|
Hello, I have been working on an extensible serialization scheme for jdbm. What follows is a description of the approach and of the changes required to integrate this into the jdbm code base. At this point the basic framework works ;-) all test suites pass (-; and I am seeking feedback on the approach before committing the changes or finalizing the approach. The basic approach is that an [int] classId is used to decode which class was serialized and the appropriate Serializer is looked up based on that classId. Two high bits are reserved in the classId. One of these bits indicates that a [byte] versionId follows, otherwise the versionId is zero. The other bit indicates that the recid of a Serializer follows, in which case that serializer is fetched and applied to deserialize the object. The mapping from classId to class name and the mapping (for each classId) from the versionId to the serializer for that versionId are maintained in some hash maps, which constitute the state of the extensible serializer. You can explicitly register classes and their serializers against the extensible serializer. New classes are transparently registered during serialization and use default Java serialization if they implement Serializable or Externalizable, otherwise it is a serialization error. When new classes or serializers are registered, the state of the ExtensibleSerializer is updated against the store. The overhead is an [int], plus a [byte] if you use the versionId, plus a [long] if you have an explicit serializerId, for a normal overhead of 4 bytes and a maximum overhead of 13 bytes. If you were willing to live with a limit of 2^14 (16,384) distinct classes, then you could bring those numbers down by 2 bytes using a [short] to store the classId. I could also see moving to a [short] for the versionId. Serializers are applied at two logical levels within jdbm. The first is the physical record whose (de-)serialization is handled by the BaseRecordManager. When caching is used, the update operation is deferred and the serializer to be applied during update is stored in the cache entry along with the reference to the object and its recid. When the entry is evicted from cache, the serializer noted on the cache entry is applied to generate the serialized form for the physical record. The second is for "compound records", such as the BPage class. BPage uses the optional key and value serializers specified on the owning BTree and default Java serialization when the key or value serializer was not specified (null). When the compound record is heterogenous then the use of the extensible serializer makes good sense. However, if you have homogenous compound records then it makes sense to use a custom serializer rather than an extensible serializer since you can realize a significant space savings. For example, if you have a BTree and all keys are one datatype (e.g., Long) and all values are another datatype (e.g., String), then using custom key and value serializers would save 4 bytes per key and value when compared to the extensible serializer (and the use of versionId and serializerId information could interfere with key compression). A mixed model with homogenous keys and heterogenous values is of course possible. Heterogenous values are persisted using Java serialization. The proposed extensible serialization approach has a variety of advantages. If you do not opt in it has no impact on the binary format of the store. If you opt in then you can optimize the serialization of your classes by configuring the extensible serializer. You can also use optimized serialization with heterogenous objects, e.g., a BTree whose keys are of a homogenous type and whose value types are heterogeneous. Any value that has a registered custom serializer will be read and written using that serializer. If you use the extensible serializer and do not specify a custom serializer for some value class, then it will use default serialization for now but you can always add a versioned serializer for that class later and transparently migrate to a more effecient serialization technique on a class by class basis. By default all classes are unversioned, but you can modify the Class implementation at any time to specify a versionId (it is identified by reflection and then cached), at which point existing instances will be read using the serializer for the unversioned class and written using the serializer for the versionId currently marked on the class. New versions can be introduced over time and old versions can always be read (as long as the appropriate Serializer is still available). Using a new RecordManagerOption (SERIALIZER_CLASS) you can opt in to the extensible serializer by naming a singleton that proxies for the extensible serializer (ExtensibleSerializer.Singleton) as the value of the SERIALIZER_CLASS property. Code which specifies an explicit Serializer in recman API calls will continue to use that Serializer (so the RecordManager API semantics do not change), but code which uses the implicit serializer, e.g., insert( obj ), fetch( recid ), and update( recid, obj ), will use the configured default serializer. For backward compatibility the default value for this property is "jdbm.helper.DefaultSerializer". A new method on the RecordManager interface, getDefaultSerializer(), returns the configured serializer instance. BTree has optional key and value serializers. When [null], these currently result in the use of Java serialization (writeObject is called on the key or value). When a serializer is specified, the key (or value) is serialized and then a [short] length is written followed by the serialized byte[]. It would be nice to change this to use the configured default serializer for the record manager, which has the same semantics when the SERIALIZER_CLASS property has its default value, since that would make it transparent to change the default serialization for btrees with the by setting the SERIALIZER_CLASS property. However that change would NOT be binary compatible change since the [short] length would always be written (which is not the case today). The state of the extensible serializer is persisted in the store. Its state consists entirely of data which would be ammenable to a flat file representation, which opens up the possibility of offline configuration of the extensible serializer. The relevant data are: class name -> classId class name + versionId -> serializer class name I ran into two problems when I tried to integrate the extensible serializer into jdbm. The fundamental problem is that the recman reference (and to a lesser extent the recid) are not available during (de-)serialization. This means that the state of the extensible serializer can not be updated against the store and also that explicit serializers (specified by recid) can not be fetched from the store. I tried a variety of approaches to get around this problem and finally decided that I needed to change the Serializer API (see below). I am open to suggestions here. Kevin suggested a thin layer over BaseRecordManager, but that seems like it would conflate the serialization choices with subclassing of the record manager, which I think should be orthogonal choices. I would like to propose that we address this problem (serializers must be stateless) by adding the recman reference (and the recid) to the Serializer API. Since serializers are sometimes invoked within other serializers, e.g., BPage uses key and value serializers, the change would have to be made manifest in all Serializers rather than just declaring a derived interface. I think that adding the recid (in addition to the recman reference) to the Serializer API makes good sense. It makes it much easier to write serializers that touch up the transient state of a deserialized object to encapsulate the recid. During insert the recid is unknown, which is marked by a 0L value. So, I propagated the proposed change through the jdbm code easily enough. This will not effect binary compatibility, but it will touch the use of Serializers in BaseRecordManager and BPage. The impact of the change on existing code would be that existing Serializers outside of the jdbm code base would have to be modified to accept two additional parameters, as follows: public interface Serializer { public byte[] serialize ( RecordManager recman, long recid, Object obj ); public Object deserialize ( RecordManager recman, long recid, byte[] serialized ); } In addition, code which uses the Serializable or Externalizable interface for persistence must be updated if it uses Serializers internally, i.e., if it is implementing some sort of compound record. The problem is that Java serialization is stateless (in the sense that it does not provide access to the recman reference or the recid). With the changed Serializer API such code is now unable to invoke a Serializer. To support people who use Java serialization I wrote some trivial wrappers for java.io.ObjectInputStream and java.io.ObjectOutputStream that carry the necessary state (the recman reference and the recid). These classes are jdbm.helper.ObjectInputHelper and jdbm.helper.ObjectOutputHelper. The following classes were modified to use these new helper classes: BPage Serialization I also made a "patch" to jdbm.helper.ObjectBAComparator since it does not have access to the recman or the recid for the serialized objects that it is comparing. There are no consumers of this class within the jdbm distribution, so I was not sure what to do in this case. Without access to use cases for this class I can't make a definitive fix, but I can't imagine that it would be difficult to find a good solution. Another issue that shows up within the context of the proposed Serializer API change is that it is difficult to layer RecordManager APIs. The situation arises when you want to invoke an operation in terms of the "outer" record manager semantics, but you only have access to the "inner" record manager. Consider: "outer" -> "inner" CacheRecordManager -> BaseRecordManager If we want the extensible serializer to take advantage of the buffering offered by the caching layer, then we really want to be passing through the CacheRecordManager reference. This can be finessed since the cache layer is introduced by the Provider, but it would be nice to have a more general solution. Consider: "outer" -> "inner" MyRecordManager -> CacheRecordManager -> BaseRecordManager Some systems that support intemediaries provide very complex request objects, which I think we can do without. I would like to suggest that we add setter and getter methods for an "outerRecordManager" property to address this issue. The property would be set when the record manager stack is configured. It could be used to make sure that operations take place in terms of the "outer" record manager when necessary. A getRecordManager() method could return the one step delegate and a getBaseRecordManager() would return the record manager that was the eventual delegate (BaseRecordManager). All in all this appears to require broader changes than I was expecting (or hoping). However I trace these changes back to two issues: (1) Serializer is stateless; and (2) awareness of the Record Manager stack. All of the changes that I have made follow from those issues. Cheers, -bryan |