How to generate preset dictionaries?

2014-06-26
2014-07-07
  • ZiNgA BuRgA

    ZiNgA BuRgA - 2014-06-26

    Hi, I'm interested in trying out the ability to use preset dictionaries, but from the description in the API docs, I'm not sure how to go about it.

    What format does it need to be in? How do I generate them?

    Do I need to compress a few files then dump the window from the LZ coder or similar?

    Or is it treated like part of the data to be compressed, that is, the dictionary is (essentially) prefixed onto the actual data for the purposes of matching, but not sent to the output?

    If the latter, are there any tools which can build a suitable dictionary based on a sample of typical data?

    Thanks.

     
  • Lasse Collin

    Lasse Collin - 2014-06-29

    Or is it treated like part of the data to be compressed, that is, the dictionary is (essentially) prefixed onto the actual data for the purposes of matching, but not sent to the output?

    Yes, that's exactly how it works.

    If the latter, are there any tools which can build a suitable dictionary based on a sample of typical data?

    I'm not aware of any but I haven't really searched either. Basically a preset dictionary should contain strings that are common in the data being compressed.

    Preset dictionary is probably most useful when compressing many similar small buffers separately. liblzma doesn't have a function to clone the encoder state (lzma_stream) which means that the preset dictionary needs to reprocessed for every buffer. This can waste quite a bit of time and thus can be a performance problem. zlib has deflateCopy() but liblzma doesn't have such functionality yet. It's planned but it won't be in 5.2.0.

     
  • ZiNgA BuRgA

    ZiNgA BuRgA - 2014-07-01

    Thanks for the response Lasse - that makes a lot more sense to me.

    Since zlib also has a preset dictionary, I tried searching if anyone's tried to do something similar but I haven't found much. I'll keep looking...


    Just out of interest, here's another somewhat related question - from a theoretical standpoint...
    Would it make sense to be able to store some sort of preset entropy coder state? Similar to a static Huffman tree I'd presume? (I'm not that knowledgeable on entropy coding, so please excuse me on that) Of course, this would be for short similar bits of data.

    Edit: or does/could the preset dictionary be used to prime the entropy coder?

    For example, if we had a database of URLs that need to be stored separately. There's a fair bit of redundancy amongst the URLs - beyond the common substrings (eg "http://") we'd have a high distribution of lowercase a-z characters and lower distribution of characters like '?' or \0. It seems like if the entropy coder could be primed like the dictionary coder, it could be useful here.

     
    Last edit: ZiNgA BuRgA 2014-07-02
  • Lasse Collin

    Lasse Collin - 2014-07-06

    It makes sense in theory but I don't have a clue how much it would help in practice (I have never tested). The preset dictionary doesn't affect anything else than the match finder. It could be made to affect the range coder too, but at least in the simplest implementation it would require fetching the primed state from the encoder after initialization and storing it for decompression. Different encoder versions and compression options would give different state so one could be stuck with a particular encoder version.

     
  • ZiNgA BuRgA

    ZiNgA BuRgA - 2014-07-07

    Ah, good point - I didn't think about software changes. I do wonder whether anyone's thought about it for zlib - considering its age you could probably stick to one version (or just assume that it doesn't really change much) and I'd imagine the advantages of LZMA become smaller with small amounts of data anyway.

    Thanks again for the response Lasse :)

     
    Last edit: ZiNgA BuRgA 2014-07-07

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks