Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

String serialization

Kevin Day
2005-08-19
2013-06-03
  • Kevin Day
    Kevin Day
    2005-08-19

    Does anyone have any comments about the best way to store a string in a byte array?

    Using a special byte combo as a terminator allows for strings of any length (and the resultant byte array is compatible with leading value compression).  However, I'm not sure of the implications of relying on a string terminator in the Unicode character set.  Is it safe to just encode an end of string with a 0 value byte?  Or do I need to use {0, 0}, or something else?

    The other alternative is to write the length of the string before the bytes - but that limits the maximum length of the string, is an inneficient use of space, and pretty much kills leading value compression.

    If only the string is being stored, then we can just use getBytes() and let jdbm handle the length for us - but that's not really an option when using custom serialization to store a composite value that includes a string...

    Any suggestions?

    Thanks,

    - K

     
    • Alex Boisvert
      Alex Boisvert
      2005-08-19

      Java follows a "modified UTF-8" encoding which ensures that no null bytes are ever found within a string.

      See http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8 for details.

      Based on this, if the serializer uses this modified UTF-8 encoding, we should be OK with a null-terminator approach.

      alex

       
      • Kevin Day
        Kevin Day
        2005-08-20

        Thanks for the link - I've been tearing what's left of my hair out looking for that.

        - K

         
        • Bryan Thompson
          Bryan Thompson
          2005-08-22

          Kevin,

          Will a general purpose "leading bytes" compression strategy work for unicode data?  What happens if it decides to break a character in half?  Perhaps we need to make the key compression algorithm part of the BTree initialization instead?  We could create default compression algorithms for common key types (long, long[3] (which you use), String, etc.).

          -bryan

           
          • Kevin Day
            Kevin Day
            2005-08-22

            Bryan-

            Yes - this supports unicode with no problems.  Remember, we are working on the byte array.  As long as we restore the original byte array properly before we send it to the deserialize() method, it will work.

            The architecture I've implemented allows for creation of specialized compression algorithms if it is desired.  Leading value compression is just a demonstration that it works (it also happens to be a pretty good approach...).

            Other compression strategies that could be implemented would be:

            Kent's original approach (i.e. store the number of common bytes across all of the keys, instead of relative to the prior key) - in this instance, the compressor would cache all of the byte arrays until compressor.finalize() is called, then it would compute the largest common block that works for all keys.

            zlib compression (this would be insanely slow, and probably not worth it)

            custom type compression, such as long key compression, etc...  For simple primitives, the leading key compression algorithm will work pretty nicely (as long as they are stored in Big Endian format, which jdbm does).

            In addition to the key compression strategy, I have also implemented a value compression algorithm into my MultiPartLongValue serializer.  It actually gives up a few bits worth of range (i.e. the two leftmost bits), to store only the parts of the value that are non-zero bits - i.e. if you only need 4 bytes to store a long value, then it only uses 4 bytes (instead of the full 8 bytes).  This, however, is implemented in the serializer and deserializer itself, instead of an external key compressor (this makes sense because it also helps to compress the values stored in the BTree and recman - not just keys).

            Anyway, the architecture I've proposed will support what you are talking about...

            - K

             
    • Kevin Day
      Kevin Day
      2005-08-21

      Alex-

      Just tried this out, and "no dice"...

      writeUTF() writes a Short to the beginning of the byte array containing the length of the string...

      In order for key compression to work, we need to have a null terminator end the string and have no junk at the beginning...

      I'm going to think on this one.  Maybe the correct approach is to instruct the compressor to ignore a specified number of leading bytes in it's algorithm...

      - K

       
    • Kevin Day
      Kevin Day
      2005-08-22

      I don't think we can use the CLASSPATH code without changing jdbm to GPL (instead of LGPL).  I'm not at all interested in doing that...

      I can rewrite this using the spec from the java docs if necessary...

      I guess for simple string storage that using the String.getBytes() method would probably work fine.

      Thinking about it, coming up with a strategy that is end terminated is really only needed if the string is used as part of a multi-part key...

      - K

       
      • Alex Boisvert
        Alex Boisvert
        2005-08-22

        Yep, I suggest we rewrite the encoding code from spec.

        alex

         
        • Kevin Day
          Kevin Day
          2005-08-22

          Alex-

          Can you forsee any issue with just using getBytes() for simple string keys?

          Platform encoding will be used, but unless a jdbm database is moved from one system to another (with a different encoding), it shouldn't be an issue...  I'm just not sure how often that kind of thing happens.

          - K

           
          • Alex Boisvert
            Alex Boisvert
            2005-08-22

            Agreed, getBytes() is sufficient.

            I think by default JDBM should store Strings in UTF-8 to avoid database portability problems.

            alex