multi-threading; lxz

2010-12-16
2013-05-30
  • Laszlo Ersek
    Laszlo Ersek
    2010-12-16

    Hi,

    I wrote lbzip2. (See http://lacos.hu.)I intend to fork lbzip2-0.23 into "lxz-0.01", remove the decompression functionality and replace the bzip2 compression with xz compression.

    (I did something like that before with lzlib (a different LZMA implementation by Antonio Diaz Diaz); that program is available now as plzip (it is maintained by him). Now I feel like doing a similar fork with liblzma.)

    I've been browsing the xz-format documentation and the header files, and I think I should be able to do this. I don't intend to maintain it very actively, and if you were interested I'd pass maintenance over to you as soon as I get a working version. Obviously this is only for the interim time while you don't finish your xz-native multi-threading; I'm not sure about how far ahead you are with that. I do know about the "pxz" utility, but I find its architecture and disk access pattern less than optimal (though those seeks shouldn't matter too much considering how CPU hungry liblzma is). Anyway I don't necessarily plan to do it for the public benefit; I mainly plan to do it because xz is wide-spread and I myself want to be able to create conformant xz files, from pipe to pipe, with MT. I consider lbzip2 to have all the necessary fluff in place (well, at least for my needs).

    So, some technical questions:

    (A) for the trivial input splitting lbzip2 does in compression mode, I need to know the following two-valued function. For each compression preset P in , what is (a) the optimal input chunk size (that will be compressed in a single shot), and (b) what is the worst compression ratio that liblzma can produce? I need (b) because I preallocate the output block (which will be a complete xz file in itself), and I need to know its maximum size in advance, based on preset P.

    I don't intend to support the "extreme" flag. The CRC would be fixed to CRC64. The single-block multi-stream format should not be much of a problem with Backward Size present in the stream footer; the Backward Size fields (combined with the Indices) allow one to seek over a multi-stream regular file backward.

    (B) The API pattern I intend to use is for each single-shot compression, once input and output space is allocated:
    - initialize stream structure (on stack),
    -

    lzma_easy_encoder(preset)
    

    ,
    -

    lzma_code(&strm, LZMA_FINISH)
    

    ,
    - assert

    LZMA_STREAM_END
    

    is returned,
    -

    lzma_end(&strm)
    

    .

    Is there some API-level optimization that I can apply here, like not using the heaviest-weight API calls? For example, it seems like I can delay the

    lzma_end()
    

    call until thread destruction time, and use a long-lived stream object throughout the lifetime of any given thread.

    Another thing I consider is

    LZMA_FULL_FLUSH
    

    instead of

    LZMA_FINISH
    

    , but I suspect that would be wrong. The blocks emitted by

    LZMA_FULL_FLUSH
    

    must apparently use the same stream object, and on the file level, be part of the same stream and index. However,
    - blocks produced by multiple worker threads will be interleaved in the output,
    - I would have to extract a single stream header and index and stream footer,
    - I would possibly have to share access to the same stream object (which would entirely defeat the purpose of multi-threading),

    so I guess

    LZMA_FINISH
    

    is the only action I can use in my existing "framework". (Not that it torments me or anything.)

    Thus could you please describe the preset -> (optimal input chunk, worst compressed size) mapping described in (A), and could you please review the API pattern in (B) wrt. sanity?

    Thank you very much for you help in advance.
    lacos (Laszlo Ersek)

     
  • Igor Pavlov
    Igor Pavlov
    2010-12-16

    Did you check p7zip?
    It supports multithreading xz creating.

     
  • Laszlo Ersek
    Laszlo Ersek
    2010-12-17

    Hi Igor,

    yes, thanks, I do know about p7zip. While it is a great program functionality-wise, it doesn't fit very well in a POSIX environment; according to my taste anyway. For example, its command line doesn't follow the Utility Syntax Guidelines. Or if you stop the compression with ^C / SIGINT, the exit status is 255, not 130. I suppose there's a reason why the "plzip", "lzip", "lzma", and "xz" tools were created, and why users ask for MT in "xz", and why Lasse Collin considers adding MT to xz.

    I mean no disrespect. p7zip is a great port of a great windows command line program; but for me, in the functionality area it can do much more than I need, and in the interface area, a bit less. Thank you for unleashing the LZMA SDK on the world :)

    So… could you please answer my questions in the topic starter? :) Thanks!

     
  • Igor Pavlov
    Igor Pavlov
    2010-12-17

    You wrote:
    "I mainly plan to do it because xz is wide-spread and I myself want to be able to create conformant xz files, from pipe to pipe, with MT."

    Does p7zip do it?
    Of yes, what problem?

    - I suppose there's a reason why the "plzip", "lzip", "lzma", and "xz" tools were created, and why users ask for MT in "xz".

    Most users don't know that p7zip already supports multithreading xz and multithreading bzip2.

    XZ Utils is good for two things:
    1) It's alternative code to 7-Zip's code . So it can be simpler to find any errors of one of these versions.
    2) XZ Utils works in linux world mostly. So It can provide better integration in low level code like kernel or SquashFS code.

    But usual linux user can use p7zip also as I suppose.

     
  • Lasse Collin
    Lasse Collin
    2010-12-17

    ipavlov's suggestion to use p7zip is the easiest option for now:

    cmd1 … | 7z a -txz -si -so /dev/null | cmd2 … > foo.xz

    Make that a script so you can use it easily in a pipe. It uses match finder threading and parallel compression of blocks inside the LZMA2 stream. It creates a single .xz Stream with a single Block.

    If you want to create your own tool using liblzma, here are answers to your questions:

    I suggest 2-4 times the dictionary size for the input buffer size. A sane value is at least equal to the dictionary size (the last chunk of the file is an exception). Look for "DictSize" on the xz man page to see the dictionary sizes associated with the compression levels. 0 uses 256 KiB and 9 uses 64 MiB, so for 0 you could use 512-1024 KiB input buffer and for 9 128-256 MiB. Note that 7-Zip/p7zip use slightly different presets.

    The worst compression ratio for lzma_easy_buffer_encode() can be calculated with lzma_stream_buffer_bound(). If you use multi-call mode i.e. lzma_easy_encoder() and lzma_code(), lzma_stream_buffer_bound() doesn't work. If you use only LZMA2 in the .xz stream, the following should be enough (in reality you don't need even that much but I'll keep this simple):

    out_buffer_size = uncompressed_size + (uncompressed_size >> 12) + 128

    Creating many .xz Streams isn't too bad although it isn't that hard to create multiple Blocks in a single Stream. Multiple Streams is quicker to code though. If there are hundreds of Streams, there will be hundreds of seeks to read all the Index fields, which can be slow. Currently this matters mostly for "xz -list". Another use case is random-access reading support but that isn't used by anything yet, I think.

    Your API usage looks good. Delaying lzma_end() is good. That way you avoid reallocating memory and thus might avoid many page faults. Using lzma_easy_buffer_encode() would be simpler API to use, but it reallocates the memory for every call.

    LZMA_FULL_FLUSH isn't useful for this at all. If you want to create a single Stream, you need to create Stream Header, Blocks, Index, and Stream Footer separately. It's not much code, but since I don't yet have any good example programs about how to do it, learning the API could take some time.

    Have you seen this:

    http://tukaani.org/xz/lzma2-openmp.c

    It compresses chunks in parallel with only a tiny effect on the compression ratio. It uses the preset dictionary feature in liblzma and the ability to fairly easily concatenate certain types of LZMA2 streams (it cannot be done with plain LZMA without extending the stream format).

    This demo code doesn't give the very best speed though. The problem is that different chunks take different amount of time to compress. There are as many input/output buffer pairs as there are threads. It's not uncommon that some threads will idle waiting for their turn to write the compressed data to stdout. There should be more buffers than there are threads. Memory usage could be reduced by using unused memory from already finished chunks.

     
  • Laszlo Ersek
    Laszlo Ersek
    2010-12-18

    ipavlov's suggestion to use p7zip is the easiest option for now:

    cmd1 … | 7z a -txz -si -so /dev/null | cmd2 … > foo.xz

    Agreed, "7z a -txz -mmt=4 -si -so /dev/null" works well as a filter.

    If you want to create your own tool using liblzma, here are answers to your questions:

    Thanks a lot for taking the time to respond!

    The worst compression ratio for lzma_easy_buffer_encode() can be calculated with lzma_stream_buffer_bound()

    I'm not sure how I could be so blind as not to notice lzma_easy_buffer_encode() - it provides everything I need and is supported by lzma_stream_buffer_bound(). I guess I was a bit clumsy navigating the header files.

    (You write in the PACKAGERS file: "it might be simpler to just let people read docs directly from the .h files for now". Would you consider adding a Makefile target that simply calls "doxygen"? The header files are littered with compatibility goo (which was probably necessary, no doubt about that), and their organization is not obvious for documentation purposes (to me anyway). I find the doxygen output much nicer to look at (links, colors, layout, and so on).)

    I think I'll stick to lzma_stream_buffer_bound() and whatever follows from its use.

    Have you seen this:

    http://tukaani.org/xz/lzma2-openmp.c

    No, not yet; though now I googled what links to it and found the other topic here (https://sourceforge.net/projects/sevenzip/forums/forum/45797/topic/3853303).

    I never used OpenMP before; it took me a while until I understood all the directives of the parallel for construct (I had to look at the OpenMP API spec). I like how expressive that region is.

    This demo code doesn't give the very best speed though. The problem is that different chunks take different amount of time to compress. There are as many input/output buffer pairs as there are threads. It's not uncommon that some threads will idle waiting for their turn to write the compressed data to stdout. There should be more buffers than there are threads. Memory usage could be reduced by using unused memory from already finished chunks.

    If you detach the multiplexer (the component that reorders the randomly produced compressed blocks and writes them out in order) from the workers, then you can spin the workers to the max. This is what lbzip2 does, and what lxz would inherit from it.

    Compressor design
    =================
    The splitter reads fixed size blocks from the input file, and passes them to
    the workers using a regular queue. For each block, the splitter needs a
    previously returned slot from the muxer.
    Each worker takes one input block from the queue, compresses it, and passes it
    to the muxer, using a head-only (unordered) list. Each single input block
    results in one compressed block (or, to be more exact, one full bzip2 stream
    containing one bzip2 block).
    The muxer regularly fetches the accumulated compressed blocks, reorders them in
    an internal red-black tree, and writes them to the output file. For each
    compressed block written, it returns one slot to the splitter. If the muxer is
    held up for some reason, it will stall the splitter, and that in turn will
    eventually stall the workers.
    The splitter and the muxer should be IO-bound, the workers CPU-bound.
    

    The queue depth between the splitter and the workers is four times the number of threads. (Arbitrary factor.) Because in compression mode for each individual chunk a single, bounded size compressed chunk is produced, both the element depth and the memory footprint of the workers-to-muxer queue is known and bounded.

    lbzip2 currently uses a red-black tree for the reordering. This is not really necessary in compression mode, a randomly filled sliding window / ring buffer would do. plzip works like that. I left the rbtree in lbzip2 because it is essential for decompression mode (see below), and I didn't want to have so many different queue structures.

    During MT-decompression, out of each individual input chunk (compressed block etc) an unbounded amount of decompressed data can be produced. This makes the reordering more complex (with sub-blocks) and a flat ring buffer won't do. Additionally, memory gets treated as an unbounded resource. If one wishes to add a memory throttle to this decompressor design, he has to implement the exact same restriction that you described and that is captured by the "ordered" OpenMP construct: workers must not create holes in the output buffer. Without this restriction, the memory throttle can deadlock the process.

    (Eg. the muxer holds decompressed (sub-)blocks #4 and #5, and waits for some thread to produce #3. Blocks #1 and #2 have already been flushed. The thread wishing to allocate memory for #3 is blocked on the memory throttle. The splitter is waiting for the muxer or blocked on the memory throttle, each worker is waiting for the splitter or blocked on the memory throttle - this is a possible situation. Deadlock.)

    *Unless*, of course, you either have stored the decompressed size within the compressed block, or are willing to decompress twice (first into the bit bucket, just to compute the decompressed size.) That way you can again have a fixed size sliding window in the muxer, and workers have to reserve space before actually decompressing. Some workers would block on the reservation, but those that wouldn't could decompress independently.

    Another thought: in compression mode, the input blocks could be reference counted. All blocks would start with refcount=2, except the last one, which would start with refcount=1. Each input block except the last would be used for (1) compression, and (2) as dictionary for the next block. The last would only be compressed. Whichever kind of use finishes with a given block last would release the block. (lbzip2 does something like that in MT-decompression mode, when an input chunk is scanned by two threads for block delimiters in a partially overlapping fashion.)

    Thanks again!

     
  • Laszlo Ersek
    Laszlo Ersek
    2010-12-18

    I released lxz-0.01 and did some completely non-scientific testing.

    Input file (usr.tar) size: 2,872,135,680 bytes

    The commands were:

    $ cat usr.tar >/dev/null
    $ pv usr.tar | \time -v xz >usr.tar.xz 2>xz.log
    $ cat usr.tar >/dev/null
    $ pv usr.tar \
    | \time -v 7z a /dev/null -bd -txz -si -so -mmt >usr.tar.7z.xz 2>7z.log
    $ cat usr.tar >/dev/null
    $ pv usr.tar | \time -v lxz >usr.tar.lxz.xz 2>lxz.log
    $ cat usr.tar >/dev/null
    $ pv usr.tar | \time -v lxz -7 >usr.tar.lxz.7.xz 2>lxz.7.log
    

    The output files were verified like this:

    $ for I in *.xz; do xz -d -v -c $I | cmp - usr.tar; done
    

    The decompression wall clock time measured by xz was recorded in the table
    below.

    Here's a diagram about the machine architecture (Xeon E5520), generated by
    hwloc: http://i.imgur.com/tYniz.png

    Some version numbers affecting more than one test run:
    - liblzma: 5.0.0
    - gcc: Debian 4.3.2-1.1
    - glibc: 2.7-18lenny2
    - linux: 2.6.26-2-amd64

    Results:

    +------------------+----------------+-------------+-------------+------------+
    |                  | p7zip          | xz          | lxz         | lxz -7     |
    +------------------+----------------+-------------+-------------+------------+
    |version           | [64] 9.13 beta | 5.0.0       | 0.01        | 0.01       |
    +------------------+----------------+-------------+-------------+------------+
    |compressed size   |    728,005,120 | 739,843,716 | 776,352,280 | 737,421,076|
    +------------------+----------------+-------------+-------------+------------+
    +# of Streams      |              1 |           1 |         172 |          86|
    +------------------+----------------+-------------+-------------+------------+
    |user (s)          |        1669.66 |     1627.38 |     1826.75 |     1914.25|
    |sys (s)           |           6.68 |        2.47 |        9.48 |        9.84|
    |wall clock (m:ss) |        4:47.95 |    27:12.79 |     3:55.46 |     4:12.37|
    +------------------+----------------+-------------+-------------+------------+
    |cpu (%)           |            582 |          99 |         779 |         762|
    +------------------+----------------+-------------+-------------+------------+
    |xz decompression  |                |             |             |            |
    |wall clock (m:ss) |           1:10 |        1:19 |        1:14 |        1:16|
    +------------------+----------------+-------------+-------------+------------+
    

    For input chunks, lxz uses the dictionary size doubled, and I chose CRC32 for
    checksumming, because (based on the former) the input size can't exceed 128
    MiB. I went with lzma_easy_buffer_encode() and lzma_stream_buffer_bound().

    I think lxz won't be entirely useless for me, at least until xz grows multiple
    threads.

    Thank you very much for your help!