#794 compression improvement: better duplicate file handling

Haudy Kazemi

Compression improvement: better duplicate file handling

This is a feature request that can significantly improve the compression ratio of archives that have duplicate files in them, particularly if the duplicated files are larger than the dictionary size. The name and location of these files does not matter (i.e. they may be in the same folder under different names, under different folders, etc.).

Use cases:
1.) improving compression ratios of solid and non-solid archives

2.) backups of filesystems with hardlinks in them

3.) drive backups of the whole filesystem, including programs: there are many gigabytes of duplicated DLLs in a complete file-based backup of a Windows drive. Many of these duplicates come from programs keeping a copy of DLLs in their own folder.

4.) Driverpacks (http://driverpacks.net/DriverPacks/ ) also have many duplicated DLLs in them due to how they must be organized into multiple folders to separate drivers

One possible algorithm:
1.) During the standard 7-Zip compression process a CRC is generated for each file. Record this CRC and the file size together in a table.

2.) as part of adding each new file to the archive, compare the CRC and file size of the new file to the table of all other files already in the archive.

3.) If there isn't a CRC and file size match, continue processing the file as 7-Zip normally would. (Traditional processing technique)

4.) If there IS a CRC and file size match, begin a byte-for-byte comparison of the two files to verify 100% that they are identical. (This byte-for-byte comparison *might* be replaced with a very strong hashing function, but that's still not as safe as a true byte-for-byte comparison. The default CRC32 is strong enough to quickly weed out non-identical files, but not strong enough to ensure they are truly identical. After all, users expect lossless compression to be guaranteed with 7-Zip.)

5.) If the byte-for-byte comparison succeeds (the files are truly identical), add an entry in the 7-Zip file table that upon decompression directs 7-Zip to use the same compressed source bytes of data as for the identical file. On Unix-like systems this would be like a creating a hardlink to a file (a pointer to the same data + a reference counter).

6.) If the byte-for-byte comparison failed, continue processing the file as 7-Zip normally would. (and possibly note somewhere in the archive that even though the CRC matches, the files were not identical).


Note: an enhancement of this algorithm would be to check the filesystem for hardlink between the two files in question, and use that as an alternative to the byte-for-byte comparison. (If they're hardlinked to the same place, there is no reason to do the byte-for-byte comparison.) Fully implemented, this might use recreate hardlinks upon restoration. (NTFS does support hardlinks, just not via the GUI).


  • wiiiiiii

    I guess it was already suggested but there is a easier way to detect duplicate files.

    User intervention. Give them a list of suspected duplicates and have them agree that they are indeed duplicates and that they only want one single file in the archive pack but all instances recreated in there start state on unpacking.

  • Guido Leenders
    Guido Leenders

    I would prefer automatic detection, so that 7zip requires no thinking of the user.

  • rbasso1979

    I think file comparison would be a great improvement to 7-Zip. Compression rate will certainly increase.

    I had the problem of compressing big files with a great probability of duplication, and I am not the only one.

    The only user intervention could be the option: "Check for duplicated files", maybe default could be NO for low level of compression and YES for high level compression, to avoid performance loss processing similar files.

  • Nemo

    I would be interested in this too. It seems lrzip manages quite well:

    Tarball of 6 consecutive kernel trees.

    Compression Size Percentage Compress Decompress
    None 2373713920 100
    7z 344088002 14.5 17m26s 1m22s
    lrzip 104874109 4.4 11m37s 56s
    lrzip -l 223130711 9.4 05m21s 1m01s

  • Mark

    I would really like to have this feature too.