wimlib / Discussion / wimlib discussion: Compression

Jeff - 2013-01-09

I am the author or ms-compress (https://github.com/coderforlife/ms-compress) which is an open-source cross-platform library for compressing in the various Microsoft compression formats. This includes LZNT1 (NTFS compression), Xpress (hibernation file and Windows Update), Xpress Huffman (WIM), and LZX (WIM and CAB).

When I started the project I looked for working implementations of these, but did not find yours - glad I found it now.

I have implemented working version of all of them besides LZX (LZX compressed data can't be uncompressed by Microsoft tools but can be decompressed by my own tools which decompress MS compressed data just fine).

Additionally, my Xpress Huffman (what you call Xpress) is my slowest one (about 4x slower than Windows' built-in implementation).

So my questions are:

On your main page you mention that the LZX documentation has inaccuracies. I agree, but would like to know if maybe I missed some (why my LZX implementation is not working).

How is your dictionary set up for the LZ copmression used in LZX and Xpress Huffman? I am sure that is my time-limiting part.

Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

synchronicity - 2013-01-09

Great questions. I didn't know your project existed; based on the git repo it
looks like you started not long after I started this project!

Yes, as you probably noticed, both the XPRESS (Huffman) and LZX (WIM-style)
decompressors and compressors are functional in wimlib and are compatible with
Microsoft's software, at least to the extent that I've tested it. You probably
want to take a look at my code, especially lzx-decompress.c and lzx-compress.c,
to get some ideas of what might be going wrong in your implementation. The
comments near the top of lzx-decompress.c document some of the differences
between LZX as used in WIM files and LZX as used in CAB files.

The LZ77 match finding code is in lz77.c and is shared between the XPRESS and
LZX compressors. It is reasonably fast because it is based on the zlib code,
which uses a hash table containing 3 character strings in the window. You're
right that this part is critical for compression performance.

It would be great to have these compression algorithms available in an external
library, but currently my code is internal to wimlib and optimized for the WIM
format (e.g. chunks are always 32768 bytes or less, so more stuff can be
allocated on the stack, and only a subset of the LZX position slots are
possible). But, it should be very useful to you to look at how I've implemented
things (provided that you keep your code GPL'ed!).

Something to keep in mind: wimlib's XPRESS compressor seems to be better than
Microsoft's, but wimlib's LZX compressor still isn't as good as Microsoft's,
which is pretty much the only thing that still annoys me about the code! I
originally thought this was due to not implementing LZX block splitting properly
(i.e. splitting the data into aligned, verbatim, and uncompressed blocks), but I
implemented this in an experimental branch and it only improved the compression
ratio a tiny bit. I think what's missing from wimlib's implementation is
further optimization of the LZ77 matching, including using length 2 matches
when appropriate. So, that's something you could think about if you're going to
be implementing better match finding for LZX.

Good luck!

Last edit: synchronicity 2013-01-09

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff - 2013-01-14

Thanks for your ideas. I will definitely look into your LZ77. I have a similar list of CAB and WIM LZX differences in my lzx.h (https://github.com/coderforlife/ms-compress/blob/master/lzx.h).

My code is and will continue being GPL'ed. I have been trying to keep *nix compatibility although I haven't tested it at on on anything besides Windows.

My code does separate out LZX-CAB and LZX-WIM for some performance reasons, but the core of it is the same (supporting multiple dictionary sizes, delta-compression or trees, etc).

After I get them all working I plan to work on the dictionaries a lot to hopefully get up to speed. My LZNT1 was my first one and I made it really good - about 30x faster than MS's implementation and marginally better compression ratio. All my others are slower than MS but slightly better compression ratio.

Once I get them up to speed I plan to make parallel versions. In particular I know of how I can make the following parallel fairly easily:

LZNT1

Xpress Huffman (for WIM and non-WIM, they use different block sizes and some other slight differences)

LZX (for WIM only, CAB doesn't have the same block system)

Thanks!

BTW my immediate problem was due to creating trees that had only 1 element in them causing the Microsoft decompressors to fail. I guess I will have to add in dummy values sometimes. For example, if there is only ever one length encoded in the length tree it fails. Or if a tree has only 0s the the pre-tree would contain only symbol #18.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

synchronicity - 2013-01-14

Okay, sounds great. Yeah, the special case of making Huffman codes from alphabets with only one used symbol is handled in lines 274-292 of compress.c in wimlib. Note that all the Huffman codes created in both the XPRESS-Huffman and LZX-WIM compressors are created by the function containing these lines.

The dictionary for LZX-WIM is always the strings drawn from the preceding data of the WIM chunk, which is always 32768 bytes or less based on what I've observed. Although there is a field in the WIM header for an alternate chunk size, based on a test I did non-32768 chunk sizes are not implemented correctly by Microsoft's software. But it might be a good idea to make your version support alternate dictionary sizes anyway, just in case they're needed.

I actually already have XPRESS-Huffman and LZX-WIM compression parellelized in wimlib. Basically, the write_stream_list_parallel() function (write.c:1187) will write a list of files to an output FILE* in compressed form, and to do this an arbitrary number of threads are created to perform parallel compression of data chunks. Ultimately this just comes down to a lot of complicated setup so that multiple instances of xpress_compress() or lzx_compress() can run concurrently on different chunks of data (up to 32768 bytes each). I'm not sure what you had in mind for parallelization, but it might be tricky to implement in a useful way in a general purpose library.

By the way, how have you testing compatibility of your LZX-WIM compressor and decompressor with Microsoft's implementation? Where is their implementation even located--- wimgapi.dll? If so, their compression functions don't appear to be exported symbols.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff - 2013-01-14

I haven't been testing LZX-WIM yet, but I am testing all the others. For LZNT1, Xpress, and Xpress-Huffman with RtlCompressBuffer and RtlDecompressBuffer. I am unsure if WIM uses this Xpress-Huffman implementation. For LZX-CAB I use FCI and FDI which are the static library version of CABINET.DLL. They don't expose the compression function directly either. How I do this is by creating a dummy CAB file around the compressed data. I imagine doing something like that for WIM would be possible as well. It will be a bit trickier (I need to make sure the SHA1 is correct, etc)...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

synchronicity - 2013-01-14

Well, I see two ways to go about testing your implementation of LZX-WIM (and XPRESS-Huffman, if it's different in WIM from elsewhere). The first would be to rely on the code in place for wimlib to make test WIMs and apply them with Microsoft's imagex.exe or Dism.exe, but plug in your own version of lzx_compress() with the same semantics as mine (lzx-compress.c:654; semantics are documented in comment). This might be the best way to avoid duplicated effort, but you'd need to be careful to implement your function with the same signature and semantics to avoid other changes to wimlib. Of course, if you do a good enough job with your code, ultimately it might be nice to get rid of my compression code entirely and just use yours via the external ms-compress library. The other solution I see would be to write your own code that can create a very simple WIM (in particular, a WIM containing one image with a single, arbitrary regular file in its root directory), and test with imagex.exe and Dism.exe this way. This would allow you to avoid messing with my code, although you'd need to understand the basics of the WIM format (header, metadata resource, XML data). Any other ideas?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Compression

Library to create, extract, and modify Windows Imaging (WIM) files

Forums

Help

Compression

Compression

Library to create, extract, and modify Windows Imaging (WIM) files

Forums

Help

Compression document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Compression