Menu

#1626 Error with .bz2 created with lbzip2

open
lbzip2 (1)
7
2018-04-15
2016-02-09
Neo IT
No

Bug with .bz2 archives created by lbzip2 (linux command-line archiver, multi-thread version of standard bzip2 archiver).
Using Ubuntu 14.04.3 LTS 64-bit, lbzip2 2.3-1 64-bit for compressing, 7-zip 9.20 64-bit on Windows 7 64-bit/15.14 32-bit on Windows 7 32-bit for decompressing.
When compressing and decompressing small .txt file all is ok.
When compressing big file (for example mirror.yandex.ru/altlinux/old/GIVC/livecd.iso) and then trying to decompress it with 7-zip (different versions) it says that archive is corrupted. Different files and different versions of 7-zip may show error on different progress (in percents) of decompression.
In the same time on linux decompressing of this file with tar or lbunzip works good.
The same file on the same system compressed in .bz2 with standard bzip2 archiver causes no errors when decompressing with 7-zip.

Discussion

  • Igor Pavlov

    Igor Pavlov - 2016-02-10

    Please provide some example of such bz2 file.
    You can split "bad" file to parts, and compress each part, so probably "bad" bz2 file will be smaller.

     
  • William Nichols

    William Nichols - 2017-10-09

    I experienced same issue
    128MB tar.bz2 created on RHEL7 kernel 3.10.0-693.2.2.el7.x86_64 with
    tar x86_64 epoch 2, version 1.26, release 32.e17
    lbzip2 x86_64 version 2.5 release 1.e17

    attempted to test/extract with 7zip 64bit 16.04 on Windows 7 64bit and 7zip 64bit 16.02 on RHEL 7
    tar.bz2 created with standard bz2 tests/extracts OK
    tar.bz2 created with lbzip2 test/extract operation fails quickly with 'Data error : <filename>.tar'
    tar.bz2 created with lbzip2 test/extract OK with tar and lbunzip2 on RHEL 7
    tar.bz2 created with lbzip2 test/extract OK with WinRar 64bit on Windows 7 64</filename>

    lbzip2 file
    https://www.dropbox.com/s/gm77vafmhxk93ph/lbzip2-R4.x267.000.0003.tar.bz2?dl=0
    md5 - 85fc812e5b99d44ba50c93d532f4d278
    sha256 - 802c85b9230968cbf71701fef8513dff6aee875aa300993e2d6fbb9df3b962f2

    standard bz2 file
    https://www.dropbox.com/s/mx9cufpilnvs2k6/nonlbzip2-R4.x267.000.0003.tar.bz2?dl=0
    md5 - cc40f6ba61d4c796b2f88cbb30e71dd2
    sha256 - 2a41b8d7da89779424ace383ffa46fe29e8c85d0593f32982adb934cd7fa563c

     
  • Igor Pavlov

    Igor Pavlov - 2017-10-10

    OK.
    I've sent question to lbzip2 developer.

    Technical description of lbzip2/7-Zip compatibility problems:

    lbzip2 - Problem 1 - The number of selectors

    The bzip2 decoder can use up to 18001 selectors (90000/50 + 1).
    But the "number of selectors" is stored in 15-bit field (32767 is max value)

    The number of selectors:

    (numSelectors <= 18001) - must work with any decoder
    (numSelectors == 18002) - must works with bzip2 1.0.6 decoder and all derived decoders
    (numSelectors  > 18002) - 
       array overflow is possible with bzip2 1.0.6 decoder, but compiled code still works without problem.
       doesn't work with some decoders derived from original bzip2 code (some apache Java version and another).
       doesn't work with 7-Zip
    

    bzip2 1.0.6

    #define BZ_MAX_SELECTORS (2 + (900000 / BZ_G_SIZE))
    structure
    {
      UChar selector   [BZ_MAX_SELECTORS];
      UChar selectorMtf[BZ_MAX_SELECTORS];
      another arrays
    }
    

    bzip2 decoder doesn't check exact number of selectors.
    So decoder can overflow selector and selectorMtf arrays. But there are another arrays after selectorMtf array in structure. So overflow data is written to these arrays, and bzip2 C decoder still works correctly.

    Some JAVA bzip2 implementations allocate only 18002 items in selector arrays and it can overflow.

    The lbzip2 decoder supports up to 32767 selectors.
    lbzip2 encoder can write (18001 + 7) selectors.
    It can use up to 7 dummy selectors in order to make block size multiply of 8 bits. Additional dummy selectors can help for better speed.
    So I suggested that lbzip2 reduce block size for 7 selectors (18002 - 7) . It's 350 bytes reduction. So it will be not more than 18002 selectors with additional dummy selectors.

    lbzip2 - Problem 2: dummy huffman tree

    lbzip2 uses dummy huffman tree.

    /* If there is only one prefix tree in current block, we need to create
       a second dummy tree.  This increases the cost of transmitting the block,
       but unfortunately bzip2 doesn't allow blocks with a single tree. */
      for (v = 0; v < MAX_ALPHA_SIZE; v++)
        s->length[t][v] = MAX_CODE_LENGTH;
    

    But all length values are equal to 20 (MAX_CODE_LENGTH) in these tables.
    And these values don't cover whole bit code huffman range.
    7-Zip checks it when building huffman tree, and 7-Zip reports about data error.
    Probably lbzip2 encoder can be changed to write some "good" lengths for full range tree.

    7-Zip/lbzip2

    I can fix both problems in 7-Zip decoder, so 7-Zip will be able to unpack such bz2 archives. But I'm not sure that I want to do it.
    I suppose that lbzip2 encoder must be fixed also, at least for problem-1.

     
  • William Nichols

    William Nichols - 2017-10-10

    Thanks for quick response, will use pbzip2 or pigz instead.

     
  • Igor Pavlov

    Igor Pavlov - 2017-10-10

    p7zip can do multi-threading bzip2 compressing also.
    p7zip can do multi-threading xz compressing also.

     
    • aONe

      aONe - 2018-03-20

      Not related with the bug, but with your comment:

      To make a tarball (tar.bz2), you can use gnutar/bsdtar with pbzip2 or lbzip2 to create it directly and using multi-threading compression.

      As far as I can tell p7zip does not make tarballs. You can also use gnutar and p7zip using an intermediate script to pass the -d/-c arguments, but this is not supported in bsdtar. So if you need to use bsdtar and want to create a tarball with multithread compression, the way to go is pbzip2 or lbzip2. I supose adding support for -d/-c parameters in p7zip won't be hard, anyway.

       
  • Ruarí Ødegaard

    So if you need to use bsdtar and want to create a tarball with multithread compression, the way to go is pbzip2 or lbzip2.

    Umm, you don’t have to do that. You can just use a pipe, e.g.

    bsdtar cf - myflies | 7z -si -tbzip2 a myarchive.tar.bz2
    

    Or in reverse

    7z -so e myarchive.tar.bz2 | bsdtar xf -
    
     

    Last edit: Ruarí Ødegaard 2018-04-15

Log in to post a comment.

MongoDB Logo MongoDB