intzip Wiki

Fast compression of integer sets.

Brought to you by: boethin

Home

IntZip is a lossless compression algorithm for strictly increasing lists of unsigned integers.

Compression means that a suitable list of integers can be encoded to a (whenever possible) shorter list of the same integer format. An arbitrary set of integers (i.e., a collection of unique numbers where the order does not matter) may be transformed into suitable input data by way of translation and sorting.

The main focus of the algorithm is to encode structural characteristics of a given list, such as coherent intervals and repeated distances. Therefore, the quality of compression depends on diversity rather than size. In most cases, IntZip is able to achieve much better compression results than general compression algorithms on such data sets. For example:

The complete list of all 65536 16-bit numbers has a trivial structure and can be encoded in just 6 bytes.
The list of all ~252K Unicode Code Points (32-bit numbers between 0 and 0x10ffff) is not structural trivial, but contains a lot of contiguous intervals. It can be compressed to a list of 700 integers, i.e. reduced by 99.7% (with GNU gzip one gets only up to ~50% compression when applied to the same data set).
There are 91 Fibonacci numbers greater than 1 fitting into 64 bit. Lists like this are hard to compress without knowledge of the special construction rule, because the numbers increase enormously fast. Here, IntZip achieves about 30% reduction by way of bit compression, wich is roughly the same reduction as obtained by general compression algorithms.