Experimenting with a preset dictionary,...

  • Michael Brian Bentley

    I would like to build liblzma for iOS. Is there a switch i can add to the configure step to persuade the process to build a static library for use on arm from an OS X machine?

    I can build liblzma two ways for OS X, the first using xz's stock configure/make/make install process to build a static lib; the other a hand-built Xcode-based build. When I use the built library in an Xcode project, I see the same problem that I see from my Xcode-based build: calling lzma_raw_buffer_encode returns LZMA_OK, but the return data length is larger than the input data, and the compressed data consists primarily of 0x00.

    Using the command line commands lzma and unlzma work just fine; I'm apparently not sure how to build the static library such that it works programmatically in an Xcode project.

    I've experimented with the call with and without a preset dictionary. When I tell the configure/make/make install process to use llvm clang instead of gcc, the runtime results are the same. So all the programmatic results are consistent but not quite correct. When I build the library in Xcode, I tweak a copy of config.h generated by the normal build process.

    Are there some things I should be seeing when I'm stepping through this code?

  • Michael Brian Bentley

    So I noticed while tracing the code that it thought it compressed the test data from 159 bytes to 38 bytes, and then figured out that a bad initial value for one of the parameters to lzma_raw_buffer_encode had to be responsible for what I was seeing, a compressed data 2048+38+header bytes long mostly filled with 0x00. Change the initial value to 0 from 2048, works.


  • Michael Brian Bentley

    Is there an ideal form for a preset dictionary that I'm not building? I have constructed a test dictionary 320K in size that is comprised of phrases that occur in the test data, in reverse order of frequency. When I compress the most commonly encountered phrase in the dictionary, the original is 109 characters long while the compressed version is 62. That is far less compression than I would expect from such a cherry-picked test case.

  • Lasse Collin

    Lasse Collin - 2012-11-25

    The rule of thumb is to put the most common strings at the end of the preset dictionary. I understood that you have done exactly this.

    62 bytes doesn't sound right. It should be less than 20 bytes. I don't have much clue what's wrong.

    If the dictionary size is smaller than the preset dictionary, the whole preset dictionary cannot be used. In that case the beginning of the preset dictionary is thrown away. However, even if you had set the dictionary size smaller than the preset dictionary, you should still have gotten good compression because your test string was from the end of the preset dictionary (assuming that the most common string was at the end of the preset dictionary).

    I need more information (code and test data) to help you more.

  • Michael Brian Bentley

    If the dictionary is smaller than the preset dictionary, then to make it fit it tosses the beginning (the good stuff?) of the preset dictionary. I need to make sure one gets its size from the other. I may not have changed the dictionary size to accomodate the preset dictionary size.

  • Michael Brian Bentley

    Making the dictionary the same size as the preset dictionary seems to improve matters. An 1846 byte message compressed down to 296 bytes using a 320K preset dictionary. This feels like a productive direction, thanks!

  • Lasse Collin

    Lasse Collin - 2012-11-26

    The best stuff (the most probable strings) should be at the end of the preset dictionary. Thus, throwing away the beginning of the preset dictionary should do the least amount of harm. It is also consistent with how compressing with LZ77-based algorithms work: they keep the most recent dict_size bytes in memory.

    For the best results, try setting the dictionary size to at least preset_dict_size + the size of the data being compressed. That way the whole preset dictionary stays in the dictionary buffer the whole time.


Log in to post a comment.