Preset dictionary

2011-06-28
2013-05-30
  • Jonathan Morton

    Jonathan Morton - 2011-06-28

    I'm considering writing a game-asset-package library to help out the developers of an existing game.  Currently the asset files are just aprayed around the filesystem, are uncompressed, and there is also a great deal of inter-file redundancy (as many of the assets are similar but not identical to each other).  Loading a single object typically requires reading half a dozen files, which are invariably scattered randomly across the disk (at least under Windows) and it is not unusual for entering the world to require reading a gigabyte off the disk.  Most of the data is very compressible - the entire asset directory compresses from over 20GB to under 4GB when using 7zip's solid mode.

    Because these assets must be loaded incrementally while proceeding through the game world, I want to avoid the latency and processing overhead of solid archives.  Nevertheless, there is a lot of potential file-size saving by recognising some inter-file redundancy.  I've run some tests which show that a solid (.tar.lzma) archive can in some cases be a third or less of the size of a non-solid .7z archive.  This is particularly associated with geometry data, since there are frequently several variants of a vehicle which have a lot of geometry in common.

    I had the idea that I could provide one ready-decoded file from earlier in the archive to the decompressor, as context for decoding another file later in the archive.  The sliding window would then advance as usual, but the dictionary would include the other file as a prefix to the current file.  However, it seems that the LZMA SDK currently doesn't support this feature.

    Would this be easy to add, or would it be easier for me to write my own compressor based on an extended version of Deflate?  (I'm certain that I *can* do the latter.)

     
  • Igor Pavlov

    Igor Pavlov - 2011-06-28

    You can set solid block size in 7-Zip for .7z archive.
    It reduces the time of random file decompression.

     
  • Igor Pavlov

    Igor Pavlov - 2011-06-28

    If you want to use xz format with XZ Utils, or xz from LZMA SDK (7-Zip), you can try to use multi-block mode. But you must write some additional code in that case.

    If you want to use 7-Zip code, it's better write to 7-Zip forum.

     
  • Jonathan Morton

    Jonathan Morton - 2011-06-28

    That's all very well, but as I already mentioned, 7-zip's solid mode doesn't meet my needs.  If I set the block size low enough to get a worst-case performance improvement, I lose half the size benefits because the context has to be rebuilt from scratch in the new block.  At the same time, I don't want to have the decompressor wade through a dozen Mk1 bodyshells in order to get to the one at the back of the file.

    All I want is to provide a memory buffer containing the reference file at the start of decompressiion.  Is that difficult to implement?

     
  • Lasse Collin

    Lasse Collin - 2011-06-28

    liblzma from XZ Utils supports preset dictionary with LZMA2. It currently isn't supported in the .xz format, but maybe it will be some day in the future. If you want to use a preset dictionary with LZMA2 right now, you need to use raw LZMA2 streams.

    If you end up going with Deflate, you can use zlib and its zlib format, which supports a preset dictionary. So you don't need a custom compressor for that, although you do need some sort of archive format to pack all the compressed files into a single file. Somewhat lazy option could be to use LZMA2 or zlib with a preset dictionary and put the compressed files into a single .zip or .7z file without further compression.

    I know one person who needs similar functionality as you do. It's for different type of data but the idea is similar, including the use of a preset dictionary or dictionaries and maybe solid compression with small blocks (only a few files per block). It might be that he won't even need filenames; file IDs (integers) could be enough. With IDs the program would need fewer seeks to find the correct file. It would also make the archive slightly smaller.

    On the other hand, using filenames instead of IDs would be simpler from point of view of the rest of the application. Unfortunately it is not acceptable to keep the list of filenames in RAM in his use case. So if filenames are needed, they need to be stored in a format that makes it fast to find the correct file without loading the whole file list in RAM.

    I guess that keeping the file list in RAM isn't a problem for you at all, but if you are interested in a file format with features described above, let me know. Nothing has been developed yet, but since the format can be application specific, it's not too bad if there will be a few incompatible versions of a similar format when different people adapt it to different use cases. So it doesn't need to be developed with extensibility in mind. This kind of file formats might already exist but I don't know any myself.

     
  • Jonathan Morton

    Jonathan Morton - 2011-06-29

    That's great, although a quick read of the API didn't reveal how to use this feature.  I did find dict_write() in lz_decoder.h, but this seems to be an internal function.  Do you have an example handy?

    I'd already anticipated the need to make a custom container format - I thought I'd base it on the high-level structure of Zip archives (ie. index at the end) but simplified in the details.  Keeping the file lists in RAM is probably not a good idea because there are over 100,000 files in a typical installation - yes, mechanical seek time is a significant factor - but I can keep a hash table to speed up figuring out which files are in which archives, and I can keep the file indexes from the past few archives in RAM.  The effect is to provide a virtual filesystem which doesn't rely on Windows being efficient (which it isn't - it can't even keep the existing files defragmented during installation).

    Using zlib as-is is not sufficient either - the dictionary size (32KB max) is too small to capture inter-file redundancy.  Typical geometry files are about 2MB or so, and I need to have at least one full file length in the dictionary.  However, extending Deflate to a larger dictionary size is pretty near trivial, provided an efficient dictionary matching algorithm is provided for the encoder.

     
  • Lasse Collin

    Lasse Collin - 2011-06-29

    Below is an example. It sets dictionary size explicitly instead of relying on presets because you want to use the same dictionary size for decompression. In theory it is possible that dictionary sizes of presets could change even though it is very unlikely in practice.

    lzma_options_lzma opt; // from lzma/lzma.h
    lzma_lzma_preset(&opt, LZMA_PRESET_DEFAULT);
    opt.dict_size = 8 << 20;
    opt.preset_dict = preset_dict_buffer;
    opt.preset_dict_size = preset_dict_size;

    lzma_filter filters; // from lzma/filter.h
    filters.id = LZMA_FILTER_LZMA2;
    filters.options = &opt;
    filters.id = LZMA_VLI_UNKNOWN;

    uint8_t *in = …;
    size_t in_size = …;
    uint8_t out = …;
    size_t out_used = 0;
    size_t out_size = …;
    lzma_ret ret = lzma_raw_buffer_encode(filters, NULL, in, in_size, out, &out_used, out_size);
    if (ret != LZMA_OK) { handle_error(); }

    Decompression is similar. There is no need to use a preset here because the LZMA2 decoder needs only the dictionary size and the preset dictionary.

    lzma_options_lzma opt;
    opt.dict_size = 8 << 20;
    opt.preset_dict = preset_dict_buffer;
    opt.preset_dict_size = preset_dict_size;

    lzma_filter filters;
    filters.id = LZMA_FILTER_LZMA2;
    filters.options = &opt;
    filters.id = LZMA_VLI_UNKNOWN;

    uint8_t *in = …;
    size_t in_used = 0;
    size_t in_size = …;
    uint8_t out = …;
    size_t out_used = 0;
    size_t out_size = …;
    lzma_ret ret = lzma_raw_buffer_decode(filters, NULL, in, &in_used, in_size, out, &out_used, out_size);
    if (ret != LZMA_OK) { handle_error(); }

    The decomperssed size will be in out_used. If you want multi-call encoding, you need to use lzma_raw_encoder + lzma_code + lzma_end. Decoding needs lzma_raw_decoder + lzma_code + lzma_end.

    The API headers are in src/liblzma/api in the XZ Utils source. They are naturally included also in the binary package for Windows. The headers have short docs which hopefully help a little. Better docs would be nice but they don't exist for now.

    As long as the filenames are fairly short, 64 bytes or so per file should be enough. So the whole file list (as a hash table or whatever) should take less than ten megabytes of RAM. This might be OK in some situations and not OK in others. I know that this isn't OK in the use case I mentioned in my previous post.

    Putting the filename index or or file ID index at the end is what I thought too. It's fine because this file format doesn't need to extracted in streamed mode.

     
  • Jonathan Morton

    Jonathan Morton - 2011-06-30

    Okay, that code is simple enough to understand easily.  Once I get out from under my day-job project, I'll give it a try.

     
  • Jonathan Morton

    Jonathan Morton - 2011-07-12

    Well, I've finally managed to build a basic test harness around this technique.  The results are very promising so far, even though I have only done very limited testing on a very slow machine.

    I started with four files, totalling about 9 MB, which I suspected were similar enough to benefit from using one as a prefix for the others.  Without prefixing, the four compress to 1.35 MB, or 15.2% of original size.

    Using the first file as a prefix, they compress to 0.8 MB, or 9.0% of original size, in total - that's the first file compressed normally, and the other three files each compressed using the (decompressed) first file as a prefix.  Using different files, or more than one file, as prefixes resulted in smaller but always positive savings.

    Clearly as more similar files are added to the archive, the savings increase further.  In the limit, the compression ratio is roughly doubled for the particular types of file I have here (which is consistent with 7-Zip's solid mode).  It might get even better if I fine-tune it using, say, the Delta filter.

    The downside of course is that exploring the combinations of files is rather expensive, requiring N^2 individual compressions.  I will need to implement some heuristics to minimise this cost, and try to make best use of multithreading.  I will also need to include some "do it fast" options to reduce the effort spent reducing the file size, so that archives can be created efficiently during an edit-compile-test cycle.

    Also necessary is getting these compressed files actually on disk.  It's just a test harness for now, so it's all happening in RAM.  Even so, it all decompresses cleanly.

     
  • Lasse Collin

    Lasse Collin - 2011-07-13

    For fast mode, either you need to support uncompressed files in the archive or you could use uncompressed LZMA2 chunks. The latter isn't so nice right now, because liblzma doesn't provide a way to create such streams. I should add such a feature.

    liblzma doesn't have a function to clone the encoder state yet (compare to zlib's deflateCopy). It would improve performance when using a preset dictionary. Initializing the encoder with a preset dictionary isn't fast because the preset dictionary needs to be run through a match finder. Cloning would be faster than running the same preset dictionary through a match finder again and again.

     
  • Jonathan Morton

    Jonathan Morton - 2011-07-14

    Fast mode is entirely relative - I was thinking of using the equivalent of the xz -0 preset for the fastest setting, and skipping the inter-file analysis (which is the *really* expensive part) for the fastest two settings, simplifying it for the third fastest version.  Falling back to uncompressed mode for incompressible data would be done explicitly, as Zip does.

    I'm not hugely concerned about the constants in the speed of the encoder - I need to figure out the exponents first, so that a typical package with a few hundred or thousand files doesn't take all year on a decently thorough setting.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks