Use compression context readonly

P.Marek
2010-11-08
2013-05-30
  • P.Marek
    P.Marek
    2010-11-08

    Hello,

    I've got another question.

    Is there some way to pass some amount of data (at least the dictionary size),
    and then switch to some kind of readonly mode, so that multiple threads can use
    the stored information to compress the new data?

    This is about a memory-constrained system which should nonetheless operate
    multithreaded.
    Using a higher level gets better compression; but using one context per thread
    isn't possible due to memory constraints.

    So I'd like to pass the first batches of data through a single thread; and when
    the basic structure of data is known, switch the context to readonly, and use
    multiple threads on this context.

    From time to time I'd like to "thaw" the context again, to enable LZMA to learn
    about the current data again, to "freeze" again some time later.

    Is it possible to do that, or would it make much work to implement that?

    Regards,

    Phil

     
  • Lasse Collin
    Lasse Collin
    2010-11-09

    It's not possible now. I can imagine some ways to do something like that with LZMA2 but it doesn't sound practical. It could be better to have a different algorithm (could be heavily based on LZMA2). I don't have time to work on such a thing for now. It's also hard to guess how much worse the compression ratio would be. With bad luck it wouldn't be so useful in practice.

    How memory constrained is the system? How many threads? How big dictionary do you need? Would it be OK to use the HC4 match finder instead of BT4?

     
  • P.Marek
    P.Marek
    2010-11-10

    Well, if I say that at the highest compression level I have to think about 1.5GB of RAM per thread, and there could easily be two Quad-cores (so 8 threads), that would mean 12GB RAM … and if I want to be able to handle several connections in parallel, I'd need separate contexts for each thread for each connection, so that doesn't really scale for the next few years.

    Of course, going with a lower compression and about 150MB RAM per context does work fine for some time …

    What would HC4 instead of BT4 mean? Would that influence the stored dictionary?
    One of the fine things about LZMA is that is stores such a long history of previous data, so similar blocks can be efficiently transferred. That's the thing I'd like to keep - but then I'd need contexts as big as possible again …

     
  • P.Marek
    P.Marek
    2010-11-10

    In other words, I'd like a way to compress a block using a context with history *without* touching the context.

    Now, for me this sounds like a flag (or maybe a new function) that simply doesn't update the context after compressing ;-)

    In case that's easier I'd be happy about leaving the input buffer size unused in the dictionary (so that the compress function can think that can copy the data in there) - that would lose me about 32kB, and gain a history of 1.5GB ;-)

     
  • Lasse Collin
    Lasse Collin
    2010-11-11

    That doesn't sound very memory constrained. I was thinking that you have some embedded system with a dual-core CPU.

    The preset 9 uses 64 MiB dictionary and needs 674 MiB of memory to encode. liblzma's maximum dictionary size is currently 1.5 GiB. With the BT4 match finder, 1.5 GiB dictionary needs 16.3 GiB of memory (10.3 GiB with HC4, but it compresses worse). But 1.5 GiB dictionary would be insane for most uses even if there was enough RAM. Even the 64 MiB dictionary used by the preset 9 often doesn't help so much compared to a little smaller dictionaries.

    Naturally it depends on the exact use case. E.g. rzip and rzip-based lrzip show that a huge dictionary can be good for speed and compression ratio at the same time when compressing big files. That's why there may be something rzip-like in liblzma in the future.

    There are many more details than just adding a flag to freeze the state of the encoder. You still need thread-specific state (small) for the range encoder and other data that is needed when creating the compressed stream. That is, you want to freeze only the match finder state, which is what eats so much RAM. The data being compressed is first copied to the dictionary and then processed from there. Since the match finder state is frozen, you cannot write the new input data to the dictionary, and thus you need to change the code in several places to accommodate the frozen match finder state.

    You might also want to make the match finder take into account the new data, which adds more complexity but can improve compression ratio. LZ77-based algorithms refer to the data in the dictionary as a distance from the end of the dictionary. LZMA (and many other LZ77-based algorithms) encode short distances with fewer bits than large distances. So you usually want to use the shortest distance when multiple identical matches are available in the dictionary.

    I understand that something like this could be useful, but unforunately I won't have time to work on this in the foreseeable future.

    Since you would be compressing only 32 KiB, have you tried if a preset dictionary of 256 KiB to maybe 8 MiB would give good results? That should keep the per-thread memory usage sane. Currently the preset dictionary can be used only with raw LZMA2 streams with C code, but support for it will likely be added to the .xz format and the xz tool in the future.

    liblzma doesn't currently have a function duplicate the encoder state (zlib has deflateCopy()), which can be a problem from performance point of view when using a preset dictionary. I will add such a function in the future.

     
  • P.Marek
    P.Marek
    2010-11-12

    Sorry, I wasn't clear enough.

    The normal block size that I get for streaming will be 4kB to 32kB - but the data stream itself might go indefinitly long.
    One of these uses had 12.1GB since Tuesday - so a bigger history should definitely help for delta-encoding and similar.

    In a test I had LZMA use ~150MB - and repeatedly streaming data from a 64MB block (random length, random position) got a very good compression ratio, so I'd like to get the dictionary as big as possible.

     
  • Lasse Collin
    Lasse Collin
    2010-11-15

    Do you need to use LZMA_SYNC_FLUSH after each block of a few kilobytes? If yes, have you checked if using HC4 match finder instead of BT4 will give you better compression? The BT match finders generally give poor results if LZMA_SYNC_FLUSH is used very often (1 KiB is very often, but 32 KiB is unclear).

    To know how much bigger dictionary can significantly help, you need to test it. Take a few gigabytes of real-world data and compress it with different settings. If you use LZMA_SYNC_FLUSH in real situation, you need to use it in testing too to get useful results.

    About freezing the encoder state: If the LZMA2 stream format is kept unmodified, the results cannot be great for long continuous streams. It might be OK for compressing small blocks and resetting the stream state after each block (so a long stream would be made from many small blocks). If the LZMA2 stream format is modified, then it might work OK for long streams without resetting between blocks. It would still have some effect on compression ratio, and I cannot predict or easily test how big the effect would be.

     
  • P.Marek
    P.Marek
    2010-11-15

    Yes, I'm using SYNC_FLUSH. No, I'm not really playing around with the
    individual parameters; currently, for testing, I'm just using easy_encoder with
    a level of 1 or 2.

    But:
    I've run 128MB of (constant) data through a bit more than 1000 (randomly)
    chosen parameter sets (consisting of blocksize, mode, dict_size, lc, lp, pb,
    nice_len, depth, and mf), and I'm looking at various subsets.

    For 1KB blocks HC3 and HC4 seem to do a bit better than BT* (cpu and
    compression).
    For 8KB the compression is nearly the same.

    I'm aware that I'd have to thaw the encoder from time to time, to let it get
    up-to-date for the new data.
    Alternatively, if the live/frozen distinction is just a flag given to
    lzma_code(), I'd think about doing a few blocks with a readonly context,
    a few more doing live compression, and so on.
    (Maybe depending on current throughput, or something similar).

    I could guarantee you that the encoder state won't be touched during a readonly
    compression, if that helps; I'm already using RW-locks, so it's no big deal.

    Increasing the dictionary value gives me these data amounts, ratio is given
    relative to original size (so 0.3 means 70% saved):
      dict_size     ratio   variance    data points
         4096       0.324   3.76e-3     204
        16384       0.307   4.79e-3     187
       262244       0.292   4.56e-3     201

    So a big dictionary brings me about 10% less data after compression.
    This is expected, as the data is expected to be somewhat correlated - and I'd
    like to exploit that.

    I'm planning to do more tests later; do you have any special ideas what
    I should test?

    Regards,

    Phil

     
  • Lasse Collin
    Lasse Collin
    2010-11-15

    dict_size = 262144 (I suppose you meant that) is 256 KiB, which is still quite small dictionary. Compare to gzip whose maximum is 32 KiB. The xz preset 0 uses 256 KiB dictionary (xz -0 on the command line). 256 KiB dictionary needs 3-4 MiB of encoder memory and less than 1 MiB to decompress.

    Increasing the dictionary size will have a big effect at first. When the dictionary size increases, it typically helps less and less. There are exceptions to this, of course, but seeing a big improvement when going from 4 KiB to 256 KiB is expected.

    To test the dictionary size, take the default preset (LZMA_PRESET_DEFAULT i.e. 6) as the starting point:

    lzma_options_lzma opt;
    lzma_lzma_preset(&opt, LZMA_PRESET_DEFAULT);

    Test with dictionary sizes from 32 KiB to 128 MiB using powers of two (1 << n, 15 <= n <= 27) as opt.dict_size. Repeat with opt.mf = LZMA_MF_HC4 if you use small blocks with LZMA_SYNC_FLUSH. Use lzma_stream_encoder() to initialize the encoder:

    lzma_filter filters = { { LZMA_FILTER_LZMA2, &opt }, { LZMA_VLI_UNKNOWN, NULL }};
    lzma_stream strm = LZMA_STREAM_INIT;
    lzma_stream_encoder(&strm, &filters, LZMA_CHECK_CRC32);

    If easily possible, use a test file that is 500 megabytes or so. It will give more truthful results with the biggest dictionary sizes (64 MiB and 128 MiB).

    The idea is to see how much the dictionary size affects the compression. The speed won't be the best with these settings, but let's not worry about that for now. I want you to find out how much it helps to increase the dictionary size. Then we can see how much memory a good enough dictionary size needs and if that is low enough to let each thread use their own encoder state.

     
  • P.Marek
    P.Marek
    2010-11-17

    Dear Lasse,

    first of all, thank you very much for your patience and help.
    I really appreciate that.

    I've got a few results for you.

    I've taken 64MB of a /usr-tar, and put random blocks of it through LZMA
    compression and decompression in a process until 256MB have been transferred.
    dict_size was from 16kB to 128MB.

    (Sorry that it's not more data - but I couldn't test with dictionaries over
    128MB anyway, as the memory got exhausted.)

    Average ratio:
    dict_size | bt2   | bt3   | bt4   | hc3   | hc4   | Total
        -------+-----+-----+-----+-----+-----+----
    16384      | 0,336 | 0,316 | 0,318 | 0,315 | 0,300 | 0,316
    1048576    | 0,264 | 0,291 | 0,266 | 0,271 | 0,276 | 0,272
    16777216   | 0,168 | 0,173 | 0,173 | 0,159 | 0,164 | 0,167
    67108864   | 0,103 | 0,088 | 0,087 | 0,107 | 0,098 | 0,097
    134217728  | 0,090 | 0,071 | 0,076 | 0,098 | 0,071 | 0,081
        -------+-----+-----+-----+-----+-----+----
    Total     | 0,189 | 0,174 | 0,190 | 0,192 | 0,179 | 0,185

    CPU time:
    dict_size | bt2 | bt3 | bt4 | hc3 | hc4 | Total
    -------+---+---+---+---+---+-----
    16384 | 146 | 127 | 126 | 98  | 98  | 118
      1048576 | 179 | 200 | 167 | 189 | 156 | 178
    16777216 | 242 | 223 | 189 | 208 | 170 | 204
    67108864 | 242 | 236 | 225 | 150 | 130 | 197
    134217728 | 222 | 221 | 221 | 95  | 85  | 171
    -------+---+---+---+---+---+-----
    Total     | 208 | 207 | 183 | 152 | 129 | 175

    My result is that a bigger dictionary helps very much, up to the amount of
    "hot" data transferred. CPU usage is reduced by a bigger dictionary, too.

    Now for the bad news …
    I got these memory needs; they are mostly independent of other parameters.
    hc3: dict_size *  7.0 + 12.0
    hc4: dict_size *  7.5 +  6.5
    bt2: dict_size * 10.5 +  1.1
    bt3: dict_size * 11.0 + 12.2
    bt4: dict_size * 11.5 +  6.5

    So, if I can use 3.5GB memory, I can make the dictionary about 500MB - if
    I limit myself to a single thread.

    I see real value in being able to make the context readonly, so that multiple
    threads can use it independently. Extra points for making the context useable
    multithreaded, so that automatically one thread gets to update it and the
    others use it readonly - then I wouldn't have to take care about locking
    myself.

    Thank you very much!

    Regards,

    Phil

     
  • Lasse Collin
    Lasse Collin
    2010-11-17

    Thanks. It seems that a bigger dictionary helps you significantly, assuming this test case matches a real-world situation.

    My current situation is that I cannot even plan the frozen state support for now. I have too many other things on the to-do list already. Freezing the state is not so quick thing to implement that it sounds to you. But I will keep this kind of use case in mind.

    What exactly is your application doing? It sounds like a file server from which clients often request the same data. Anyway, have you taken into account that with frozen encoder state and multiple threads & connections, all receivers would need to have the same dictionary buffer that was used for encoding? So if you use 512 MiB dictionary, all receivers would need 512 MiB dictionary buffer in memory and they would have to match the state of the encoder.

     
  • P.Marek
    P.Marek
    2011-01-14

    Another idea: Is it possible to cut the index size? I'd be happy if matches were found only if they are aligned on multiples of 512 bytes; I think that this would reduce the trees enough that I could use a much bigger dictionary.

    How about a mode that just looks for similar blocks on 512 byte borders, and if it finds one, simply does a huffman(buffer XOR old-block)? should be efficient enough, and might need most of my needs.

     
  • Lasse Collin
    Lasse Collin
    2011-01-19

    Yes, it is technically possible.

    Note I haven't been able to work on XZ Utils recently. I haven't even added the ability to change the mode parameter yet.

     
  • P.Marek
    P.Marek
    2012-07-12

    Another wish: please provide a callback (similar to the allocate/free ones) that gets called to "unlock" a context when the necessary things have been done, to allow overlapping processing.

    I imagine this workflow:
    * a thread "locks" a context (within some user-defined memory)
    * calls lzma_(compress|uncompress)
        - this copies the given data into the dictionary,
        - adjusts some values,
        - fixes indizes
        - then the lzma function invokes the callback
        - and continues to compress the given data with the available dictionary
    * after the unlock has been issued, another thread can start to pass
       data to lzma

    Now the granularity/parallelicity might not be that great currently …
    but perhaps LZMA can allocate more memory beforehand, and do some RCU to
    allow concurrent processing?

    I have to admit that I don't know enough internals to say whether that's easy,
    hard, or impossible …

    How much work could that be, ie. could sponsoring help?

    Thank you very much!

    Regards,

    Phil

     
  • Lasse Collin
    Lasse Collin
    2012-07-22

    There are a couple of quite major problems:

    - My ability to work on XZ Utils is still limited. When I work on XZ Utils, I have several things that have a higher priority. I'm sorry.

    - I don't completely understand what kind of feature you are requesting.

    - I don't understand the underlying problem as a whole that you are trying to solve. Thus, even if I fully understood what you are requesting, I would have no idea if it actually does what you are hoping it to do.

    :-)

    I have got an impression that you have a server that serves multiple clients. You want to compress the data streams between the server and clients. The problem is keeping the memory usage low enough on the server when many clients are connected at the same time. The memory usage on the clients isn't a problem.

    Since you are sending (somewhat) different data to each client, you cannot use a single encoder state and send the same compressed output to every client. You are suggesting that the encoder should able to freeze the state for a while and share this frozen read-only state between threads so that it could be used to encode data to be sent to multiple clients (different data to each client). Now and then the state would be unfrozen and updated to contain recently seen data.

    I don't know how much I got it wrong above; please correct me if needed. I continue with the assumption that the above is correct.

    Even if something like LZMA2 with freezing support was implemented, the step where the new data is added to the dictionary on the server sounds a bit suspicious (I already tried to point out a potential problem in message 11, paragraph 3).

    To simplify, the encoder state can be split into dictionary (dict_size bytes or slightly more), match finder (a few times dict_size), and other data (not significant in this discussion). The decoder needs state for the dictionary (dict_size bytes) and other data (not significant here).

    You need to keep the states of the decoders on the clients in sync with the encoder on the server. If you add several megabytes of data to the dictionary on the server, you need to get the same data added to the dictionaries of the clients too. Sending megabytes of data to every client just to update their dictionary buffers might make the idea of frozen encoder useless in your situation, but it depends on the details, I guess.

    Encoder freezing support might really be what you want, but since I don't know what the underlying problem as a whole is, it's possible that something very different could be much better.

     
  • P.Marek
    P.Marek
    2012-07-22

    No, the stream that is to be compressed is for a single client.

    But: I'd like to take advantage of the many CPUs that are available, by doing some kind of parallel processing.

    There are a few options for this:

    ** Use multiple LZMA contexts.

    + scales to many CPUs
    - much more memory needed
    - bad compression ratio, as each context will only see parts of the data and might not know that the same has just been processed.

    ** have a single context and locking

    + works now
    + best compression
    + memory usage only for 1 context
    - only a single CPU can work at a time

    And that's where the idea that I tried to explain above comes it:

    ** Use a single context, but some more memory
    I know that each incoming block is at most L bytes long, and that there are C CPUs (threads).

    So it might be possible to allocate a dictionary that's L*C bytes longer; then each thread can add its data without worrying that it destroys the dictionary of the (still active) threads that started earlier.

    Now, if there was a way to get the indizes be useable in parallel, too , this should work.

    + scales to many CPUs
    +- uses a bit more memory than normal
    + has the complete dictionary size for finding matches

    If the indexes granularity would be made coarser, eg. to 64 or 512 bytes, their size might shrink a fair bit, too.

    Ad 1: of course the indizes have to get some more memory, too
    Ad 2: perhaps by making them kind of constant - instead of changing an entry copy it and all parents to build a new tree?

    I hope that clarifies a bit what I want to achieve.

     
  • Lasse Collin
    Lasse Collin
    2012-07-27

    In short, you seem to want threaded compression with support for frequent flushing. However, you don't need it to happen as a single continuous stream; you are fine with having parallel streams that depend on each other. Making such an algorithm sounds possible but I'm not willing to try it in the foreseeable future, sorry.

    In the message 12 you mentioned that it might be enough to look for matches aligned to multiples of 512 bytes. Some random thoughts about that:

    There's an old idea to have a two-level match finder for LZMA: One that works like the current ones and another that cares about e.g. aligned 512-byte blocks. That will reduce memory usage and hopefully not reduce compression ratio too much.

    Maybe a much simpler thing would be enough for you: Have a simple long-range LZ77 algorithm that works on aligned 512-byte blocks without any entropy coding. That should be fast and perhaps a single thread would be enough. Optionally the output of that could be compressed with e.g. zlib or liblzma (with a small dictionary).

    rzip has a long-range algorithm that is fancier than the above example but it's still fast. The output of the long-range step is compressed with bzip2. lrzip continued from rzip and supports multiple algorithms to compress the output of the long-range step. You could try how lrzip compresses your data (with different backend compressors and also without any backend compression).