First of all, I'm a big fan of 7zip. It's unbeatable at what it does; none of the commercial archivers come even close.
Recently, I upgraded from 9.38 to the 15.xx branch of 7-zip, because why not. But I found out that there is a compression regression. I don't know to what extent my scenario is an extreme/specific one, but this issue renders 15.xx far worse than 9.38 for me.
Scenario:
Imagine a backup of your data - not incremental, but rather say a monthly backup of your folder. Various type of data - pictures (jpeg, png, etc), exe files, documents, videos, even some archives. You change the contents of this folder from time to time but a lot of these files remain as they are - not changing, or maybe just moving from folder A to folder B but still being within the "root" folder of this backup.
Imagine your 'base' folder is ~50gb - so if you have 10 months of backup, you will have 500gb of backup.
In 9.38, when I compress (LZMA2, dictionary 1024M, word size 256, 3 threads, solid block), the compressor works like this:
..... you get the idea. It compresses the first file and then simply goes "yes this is the same file here" so the compression is essentially smaller than one file for all X counts of the same file.
**The 15.xx version compressor goes like this:
**
2015_01_01/folder/a.jpg
2015_01_01/folder/b.jpg
2015_01_01/folder/c.jpg
2015_01_01/folder/d.jpg
2015_02_01/folder/a.jpg
2015_02_01/folder/b.jpg
2015_02_01/folder/c.jpg
2015_02_01/folder/d.jpg
Which I assume is okay if the dictionary can cover for the entire size of the archive. But imagine that the folders are all 50GB big - I 'think' that by the time the compressor gets to the new folder, it already no longer has the required 'word' in the dictionary to pack file a by simply referencing it as a copy of a file it had already compressed previously.
Now, I don't know how this algorithm works. If it compares hashes or if it just immediately looks for the same filenames and sorts files by filenames.
It feels like the 15.xx compression suffers because the compressor is trying to prioritize file order (in the archive) over compression.
For illustration. 9.38 was able to pack my 800GB archive into ~94GB. 15.xx was not even 40% done but the compression of the same files was already at 150GB.
Last edit: Little Vulpix 2015-11-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are some reasons why the order was changed:
1) It's better for some types of data (for example, if you compress source code tree).
2) It's better for HDD (it reduces seek time) when the order of files in archive and order of files in HDD is same.
You can use old "type" order, if you write
qs
in Parameters field.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Aha! Thank you! This works for me. the -mqs / qs is a new parameter right? I've never seen it before. But it does exactly what I needed. You can close this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Igor!
First of all, I'm a big fan of 7zip. It's unbeatable at what it does; none of the commercial archivers come even close.
Recently, I upgraded from 9.38 to the 15.xx branch of 7-zip, because why not. But I found out that there is a compression regression. I don't know to what extent my scenario is an extreme/specific one, but this issue renders 15.xx far worse than 9.38 for me.
Scenario:
Imagine a backup of your data - not incremental, but rather say a monthly backup of your folder. Various type of data - pictures (jpeg, png, etc), exe files, documents, videos, even some archives. You change the contents of this folder from time to time but a lot of these files remain as they are - not changing, or maybe just moving from folder A to folder B but still being within the "root" folder of this backup.
Imagine your 'base' folder is ~50gb - so if you have 10 months of backup, you will have 500gb of backup.
In 9.38, when I compress (LZMA2, dictionary 1024M, word size 256, 3 threads, solid block), the compressor works like this:
(assume file A is always the same file):
2015_01_01/folder/a.jpg
2015_02_01/folder/a.jpg
2015_03_01/folder/a.jpg
2015_01_01/folder/b.jpg
2015_02_01/folder/b.jpg
2015_03_01/folder/b.jpg
..... you get the idea. It compresses the first file and then simply goes "yes this is the same file here" so the compression is essentially smaller than one file for all X counts of the same file.
**The 15.xx version compressor goes like this:
**
2015_01_01/folder/a.jpg
2015_01_01/folder/b.jpg
2015_01_01/folder/c.jpg
2015_01_01/folder/d.jpg
2015_02_01/folder/a.jpg
2015_02_01/folder/b.jpg
2015_02_01/folder/c.jpg
2015_02_01/folder/d.jpg
Which I assume is okay if the dictionary can cover for the entire size of the archive. But imagine that the folders are all 50GB big - I 'think' that by the time the compressor gets to the new folder, it already no longer has the required 'word' in the dictionary to pack file a by simply referencing it as a copy of a file it had already compressed previously.
Now, I don't know how this algorithm works. If it compares hashes or if it just immediately looks for the same filenames and sorts files by filenames.
It feels like the 15.xx compression suffers because the compressor is trying to prioritize file order (in the archive) over compression.
For illustration. 9.38 was able to pack my 800GB archive into ~94GB. 15.xx was not even 40% done but the compression of the same files was already at 150GB.
Last edit: Little Vulpix 2015-11-18
There are some reasons why the order was changed:
1) It's better for some types of data (for example, if you compress source code tree).
2) It's better for HDD (it reduces seek time) when the order of files in archive and order of files in HDD is same.
You can use old "type" order, if you write
in Parameters field.
Aha! Thank you! This works for me. the -mqs / qs is a new parameter right? I've never seen it before. But it does exactly what I needed. You can close this.