Serialization Of LZMA2 Running Compressor

John Herry
2010-11-30
2013-05-30
  • John Herry
    John Herry
    2010-11-30

    Hello, every body:
          Is there any way to serialize a LZMA2 encoder?  eg. We are compressing texts with "LZMA_FLUSH_SYNC" action,  at some point we want to save the current compressor (All its data structures and dictionary) as a file (or file-like ) object on to disk. then we send this `serialized compressor` to another machine(or just in the same machine), on which we rebuild a compressor with same data structure and states, so that it can compress  new texts just like the original encoder.
         I think current Lzma2 may not support such functions. If we want to realize it, What data and state in memory should be kept in serialized version( so that the compressor can be rebuilt directly)? Is there any detailed suggestion?
         And, we had supposed that the compressing dictionary may be reused to compress data with relavent content, we need not care about the evolovement of the dictionary,  If the current dictionary is built on former uncompressed data, and the next uncompressed content has same part with former uncompressed data,  then we will get good compress radio when compressing next data with current dictionary.

    Best Regards!

     
  • Lasse Collin
    Lasse Collin
    2010-12-02

    liblzma won't support saving the state in the foreseeable future. The internal data structures and algorithm details may vary between versions so saving the state would be unlikely to work between different versions of the library.

    If you don't need the exact state and are happy if you can continue compression on another machine, the preset dictionary feature might help. Concatenation would work only with raw LZMA2 streams though (not .xz streams). But even that would have slight penalty in compression ratio because it would reset the probability arrays.

    If all you want is to compress small independent chunks of similar data on different systems, the preset dictionary feature may be what you are looking for. Unfortunately it isn't supported in the .xz container yet, but I think it will be when XZ Utils 5.2.0 is out (no idea when, not very soon anyway). So right now the preset dictionary is limited to raw LZMA2 streams.

    To make it faster to use the same preset dictionary to compress multiple files, liblzma needs a function to duplicate the encoder state (similar to deflateCopy() in zlib). I will add such a function in the future.

     
  • John Herry
    John Herry
    2010-12-03

    Hi, Larhzu
        Thanks for your help. But there are still three point that I am not quite clear.
    1、How  to save the LZMA2 dictionary( as a "preset dictionary" afterwards ) during compressing? I didnot found such a interface to access dicitonary directly.
    2、As you explained in another topic, LZMA2 dictionary alwarys  "holds the most recently compressed or decompressed uncompressed bytes". So If I use a "preset dictionary", Will this dictionary keep fixed or continuly updated during compressing process?
    3、How to decompress such a "raw lzma data"? Do I need to give the "preset dictionary" to the decompressor? How to initialize the decompress with it?

    Best Regards!

     
  • Lasse Collin
    Lasse Collin
    2010-12-03

    Unless you are using multiple filters, the dictionary is nothing more than plain uncompressed data. You don't need any interface to get the dictionary. Just put typical data into a buffer that will be used as a preset dictionary. Put the most likely data to the end of the buffer. See src/liblzma/api/lzma/lzma.h or /usr/include/lzma/lzma.h. You need the same preset dictionary when decompressing.

    The dictionary is updated the same way even if you use a preset dictionary. So the dictionary is updated all the time.

    Encoder initialization:

    lzma_options_lzma opt;
    lzma_lzma_preset(&opt, LZMA_PRESET_DEFAULT);
    opt.preset_dict = buf;
    opt.preset_dict_size = buf_size;
    lzma_filter filters = { { LZMA_FILTER_LZMA2, &opt }, { LZMA_VLI_UNKNOWN, NULL } };
    lzma_raw_encoder(&strm, filters);

    The decoder goes the same way except with lzma_raw_decoder().

     
  • John Herry
    John Herry
    2010-12-04

    Oh, The dictionary is realily out of my thought, I had supposed that it should be a  set of some most frequently used keywords in a `Hashable` data-structure in Memory. Now it is clear to me, Thank you very much, Larhzu.

    Best Regards!

     
  • Lasse Collin
    Lasse Collin
    2010-12-09

    There is a hash table in the match finder in the encoder. Dictionary is one part of the match finder data structures.

     
  • John Herry
    John Herry
    2010-12-10

    Thanks, larhzu.
         Is there any detailed document where I can get deeper into LZMA2 compressor? Now I can only get some info from lzma head files in `include` directory. And It is somewhat hard for me to get deep into the algorithm through source code. I think some documents about LZMA2 theoretics may be of great help for me to follow the code.

    Best Regards!

     
  • Lasse Collin
    Lasse Collin
    2010-12-13

    As the XZ Utils FAQ says, there are no docs about LZMA at the moment. I have planned to write some docs about it, but there are many other things I should do too, and documenting LZMA isn't so high priority at the moment.