Compression improvement: better duplicate file handling
This is a feature request that can significantly improve the compression ratio of archives that have duplicate files in them, particularly if the duplicated files are larger than the dictionary size. The name and location of these files does not matter (i.e. they may be in the same folder under different names, under different folders, etc.).
1.) improving compression ratios of solid and non-solid archives
2.) backups of filesystems with hardlinks in them
3.) drive backups of the whole filesystem, including programs: there are many gigabytes of duplicated DLLs in a complete file-based backup of a Windows drive. Many of these duplicates come from programs keeping a copy of DLLs in their own folder.
4.) Driverpacks (http://driverpacks.net/DriverPacks/ ) also have many duplicated DLLs in them due to how they must be organized into multiple folders to separate drivers
One possible algorithm:
1.) During the standard 7-Zip compression process a CRC is generated for each file. Record this CRC and the file size together in a table.
2.) as part of adding each new file to the archive, compare the CRC and file size of the new file to the table of all other files already in the archive.
3.) If there isn't a CRC and file size match, continue processing the file as 7-Zip normally would. (Traditional processing technique)
4.) If there IS a CRC and file size match, begin a byte-for-byte comparison of the two files to verify 100% that they are identical. (This byte-for-byte comparison *might* be replaced with a very strong hashing function, but that's still not as safe as a true byte-for-byte comparison. The default CRC32 is strong enough to quickly weed out non-identical files, but not strong enough to ensure they are truly identical. After all, users expect lossless compression to be guaranteed with 7-Zip.)
5.) If the byte-for-byte comparison succeeds (the files are truly identical), add an entry in the 7-Zip file table that upon decompression directs 7-Zip to use the same compressed source bytes of data as for the identical file. On Unix-like systems this would be like a creating a hardlink to a file (a pointer to the same data + a reference counter).
6.) If the byte-for-byte comparison failed, continue processing the file as 7-Zip normally would. (and possibly note somewhere in the archive that even though the CRC matches, the files were not identical).
Note: an enhancement of this algorithm would be to check the filesystem for hardlink between the two files in question, and use that as an alternative to the byte-for-byte comparison. (If they're hardlinked to the same place, there is no reason to do the byte-for-byte comparison.) Fully implemented, this might use recreate hardlinks upon restoration. (NTFS does support hardlinks, just not via the GUI).