Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
I wondered if a smarter approach couldn't result in better compression ratios.
I have experimented with a set of files, total 2,147,483,648 bytes, 7zipped to 1,620,278,810 byte (75% compression ratio). With a more clever method, I have reduced it to 971,212,471 bytes (compression ratio of 45%). And what may appear even more amazing the computing resources of the whole compression step (memory and cpu load) were drastically reduced. Also some quick testing showed that my method was far from being optimal and had a lot of room for improvements.
So what's the deal? I have not reinvented relativity, I just added one step between raw files and compression. It is a set of 10 relatively large files but similar, you could think of them as clones. 7zip is performing quite good on clone sets except if their size is too large. So I used another diffing program, xdelta (open source, xdelta.org ) and after choosing a parent and creating diff for the nine other files, I compressed the parent along with the diff and the program to recreate the clones and that's how I ended with my result.
Problem : this method while spectacularly effective on this case would not work in the general case
Answer : which is why it could be implemented as a separate 'clone' method along with lzma and the likes.
Other applications : the idea of adding an extra step has other applications, that other step could be for example implementing decryption/reencryption. A lot of files are encrypted for the sole purpose of increasing their file size making their copy difficult, sometimes the encryption is reversible. In those cases decrypting them before compression is reducing the size drastically while keeping the whole process lossless as they can be reencrypted.
All of this may appear highly custom, but I believe this could be the next stage of archivers at a time when reducing the ratio further in a general configuration appears very difficult. Let me hear your thoughts please.
using xdelta for this purpose now is really esoteric. i recommend you to use exdupe -c0. there is also other deduplication software (zpaq, srep). but i agree with you: built-in deduplicaion is one of the most asked features for modern archivers, so it would be great to get one in the 7-zip
Yet the author of 7zip isn't caring?
Today I have come across a 7zip based software which basically does what I exposed there, except is doing it automatically and even comes up with an installer. Only problem for me is it's not open source so I do not want to use it. Yet the results are impressive, here they are for reference:
Yes it is beating 7zip to the pulp, so what's taking so long to add this?