Wikimedia uses 7-Zip to compress its 10TB of full-change-history dumps monthly; 7-zip's support for large dictionary sizes allow it to get a many-times-better compression ratio than bzip--only 7-zip compresses away the long sections of text that each revision of a page shares with the previous rev. The CPU time it takes to make those .7zs is, of course, pretty substantial (days running on 27 cores, I think). I've been playing with running dumps through a sloppy-but-quick byte-oriented precompressor that only finds long matches with a 4MB history buffer. Though the program is fairly unsophisticated, the results were decent: on Wikipedia's dump XML, I can get average ratios similar to 7zip's while compressing at on the order of 100 MB/CPU-sec. Yay!
7za -mx=3 also uses a 4MB history buffer and gets almost as good a compression ratio, but it's about 3x slower than a pipeline with histzip and another compressor. 30MB/s is still quite good in absolute terms, and a big improvement on the rate Wikipedia's getting now, but the gap between that and 100MB/s is nothing to sneeze at when looking at 10TB to pack.
A few folks on the Wikitech list have said that I should talk to you about all this, and that you've noted interest in lrzip/rzip-like algorithms on past threads. If you were to implement some kind of long-range compressor, perhaps the right form would be as a pre-compression filter like the one that exists for executables, and its goal might be improving compression speed on data like Wikipedia's, as much as or more than improving the ratio. Then 7-zip would work even better for those datasets with no chance of regressions for other users.
I'm not sure it makes sense to make any type of change to 7-Zip for such a niche case, and of course a new filter would mean new .7z files that old versions of 7-Zip can't read. But since folks said I should talk to you, and in any case this might all be interesting data for you (even if it doesn't lead to any specific work), it seemed to make sense to drop a note.
(I should also say I'm really impressed at the 30 MB/s speed for 7za -mx=3, considering all of the work the compressor's still doing and that it's almost like it's being pressed into service as a diff engine when it's designed as a general-purpose compressor. It's a space shuttle that also turns into a bicycle!)
The (very simple) precompressor is https://github.com/twotwotwo/histzip . histzip|bzip is currently the only way I've found get 7z-like ratios on wiki dumps at 100MB/s, but replacing bzip with an LZMA compressor gets pretty close results.
Here's a Wikimedian saying it would be great if a long-range-ish compression strategy were integrated (though he says LZMA, not 7-zip): http://lists.wikimedia.org/pipermail/wikitech-l/2014-January/074114.html
The history files themselves are the pages-meta-history .7z's at http://dumps.wikimedia.org/enwiki/latest/
i suggest you to look into more modern tools than rzip: zpaq, exdupe, pcompress and my own srep: http://freearc.org/research/SREP39.htm#3.91
Those are very cool (I like the FutureLZ idea--reducing memory needs for the decompressor). I think I got referred here in part because of the wide distribution of/Wikimedia's current use of 7-Zip. In any case, thanks for the link.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.