.tar.gz random access
A free file archiver for extremely high compression
Brought to you by:
ipavlov
Currently, when you want to extract a single file from a .tar.gz archive, the entire portion of the archive up to the start of that file has to be read twice:
This process can be optimized. If the decompression dictionary is cached when entering the archive as a folder, and an index with offsets of all packed files in the gzip stream is saved, then extraction can start directly from the required offset. This way, decompression won’t require processing the entire preceding part of the archive. This could significantly improve random access performance in .tar.gz archives.
Here is PoC in python using awesome indexed_gzip library:
In this simple example, building the gzip index and the tar manifest are two separate "passes" through the source archive. In a full implementation, both can be done in one pass.
Don't forget to do
or
depending on your system
Usage:
Also attached sample large_archive.tar.gz for testing
large_archive.tar.gz (github.com)
Found a tool implementing exactly this logic:
https://github.com/mxmlnkn/ratarmount