Last weekend I implemented an optimization to capturing an image in wimlib that basically leaves streams unhashed after the call to wimlib_add_image(), then hashes them on-demand in wimlib_write(). This should give most of the caching advantage of ImagePyX's method, but also keep the API intact, so it should still be possible to do things like call wimlib_add_image() multiple times to add multiple images before calling wimlib_write(), or mix in calls to other APIs like wimlib_delete_image(). These APIs are not mixed in this way in wimlib-imagex, but I think it is a good idea to keep the API more general. I have also implemented the optimization I mentioned where streams with unique size need not be hashed more than 1 time.
I also made it so that hard-linked files are detected and immediately linked to the same in-memory inode, without any need to read the file contents again. (I don't recall if ImagePyX implemented an equivalent optimization or not.)
I still need to do some work to get multithreaded compression working again as I re-wrote of the code, and I will do some benchmarks once I have it working.
The idea to exclude certain files based on filename is potentially a good idea and is actually already implemented in Microsoft's imagex.exe based on the [CompressionExclusionList] configuration. One thing to keep in mind is that this is basically just a heuristic and there is no guarantee as to a file's contents merely because it has a certain name, and in general a file may be accessible through multiple different names anyway.
Aborting compression partly through a file if it isn't compressing well may be another good heuristic, but again these is no guarantee this will work well in the general case. I personally would lean towards just trying to compress the whole file, as is done currently, so that we can guarantee the size of the output WIM is as small as possible.
Ok, I'll try to implement my stuff in ImagePyX and see... the idea of the threshold came from observing that a typical Windows folder contains large (possibly LZXed) CAB files, like driver or setup repositories. If you want compression at every cost, then you'll appreciate those few bytes earned by shrinking the initial CAB directory; if you want maximum speed, it is probably better to predict the result and avoid processing (to say, with slow LZX) the next 50-99% of the contents.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.