Other tricks to improve capture speed?

maxpat78
2013-04-02
2013-04-03
  • maxpat78

    maxpat78 - 2013-04-02
    • your idea: directly compress a stream if the SHA1 of its first chunk does not belong to a special hash table, so preventing it is read twice?
    • directly copy a stream smaller than [n] bytes?
    • copy files belonging to a tipically uncompressible/few compressible type (jpg, cab, zip...)?
    • provide a user-tunable threshold to abort compression and save CPU time: if, after the [nn]% of input has been reached, compression ratio is zero, then abort compression and simply copy the stream; if ratio is less than [nn]%, then begin emitting uncompressed chunks only?
     
    Last edit: maxpat78 2013-04-02
  • synchronicity

    synchronicity - 2013-04-02

    Hi,

    Last weekend I implemented an optimization to capturing an image in wimlib that basically leaves streams unhashed after the call to wimlib_add_image(), then hashes them on-demand in wimlib_write(). This should give most of the caching advantage of ImagePyX's method, but also keep the API intact, so it should still be possible to do things like call wimlib_add_image() multiple times to add multiple images before calling wimlib_write(), or mix in calls to other APIs like wimlib_delete_image(). These APIs are not mixed in this way in wimlib-imagex, but I think it is a good idea to keep the API more general. I have also implemented the optimization I mentioned where streams with unique size need not be hashed more than 1 time.

    I also made it so that hard-linked files are detected and immediately linked to the same in-memory inode, without any need to read the file contents again. (I don't recall if ImagePyX implemented an equivalent optimization or not.)

    I still need to do some work to get multithreaded compression working again as I re-wrote of the code, and I will do some benchmarks once I have it working.

    The idea to exclude certain files based on filename is potentially a good idea and is actually already implemented in Microsoft's imagex.exe based on the [CompressionExclusionList] configuration. One thing to keep in mind is that this is basically just a heuristic and there is no guarantee as to a file's contents merely because it has a certain name, and in general a file may be accessible through multiple different names anyway.

    Aborting compression partly through a file if it isn't compressing well may be another good heuristic, but again these is no guarantee this will work well in the general case. I personally would lean towards just trying to compress the whole file, as is done currently, so that we can guarantee the size of the output WIM is as small as possible.

    Thanks!

     
  • maxpat78

    maxpat78 - 2013-04-03

    Ok, I'll try to implement my stuff in ImagePyX and see... the idea of the threshold came from observing that a typical Windows folder contains large (possibly LZXed) CAB files, like driver or setup repositories. If you want compression at every cost, then you'll appreciate those few bytes earned by shrinking the initial CAB directory; if you want maximum speed, it is probably better to predict the result and avoid processing (to say, with slow LZX) the next 50-99% of the contents.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks