The ZIP archives in this directory contain source code only. Due to
their size, the language models for reconstruction are located in the
subdirectory LanguageModels and the language identification models are
not included with this release (get them from the Language-Aware
Strings project, https://sourceforge.net/projects/la-strings/files/).
RECENT CHANGES
==============
v1.00gamma 2013-05-07:
Hotspot optimization reduced reconstruction time by about 25%.
Avoiding recomputation of n-gram scores during incremental updates
when the original computation did not contribute to a wildcard's
overall score increased the speed-up to 35% relative to
v1.00beta.
Changed scoring function to eliminate an exp() in the innermost
loop, increasing the speed-up to 50+% relative to v1.00beta with
virtually identical reconstruction accuracy.
Made "aggressive inference" (periodically assigning replacements
for all wildcards with highly-skewed score distributions) the
default, as it proved to improve both reconstruction accuracy and
run time. Reversed the sense of the -r^ flag to allow the user
to disable it.
Initial implementation of a word-length model for automatically
detecting DEFLATE stream corruption; added -r:l flag to enable its
use. This approach proved unsuccessful in detecting corruption.
Restored word-unigram model code from v0.9 and adapted it for use
in detecting corruption; added -r:w to control its use.
Fixed segfault while verifying a candidate RAR header when the
header-size field produces a header size which extends beyond the
end of the input file.
Fixed test-mode reference matching to correctly handle a
within-packet corruption when re-alignment across corruption is
disabled.
Ensure proper display of multiple newlines in HTML mode.
v1.00beta 2013-02-13:
Initial implementation of first phase of packet-end recovery.
Search proved to be intractible in the general case, but usable
when the Huffman trees are known (e.g. corruption in the middle
of a packet).
Implemented recovery of packets with corruption in the middle,
including a search to re-align the decompressed data such that
back-references across the corrupt region refer to the correct
bytes.
Added handling of zlib-style sync/flush markers as additional
headers for finding DEFLATEd data.
Refactored recovery code to use a list of DEFLATE packets,
permitting multiple packets to contain corruption and enabling a
user-specified corruption range in each packet. Updated -t flag
to permit an arbitrary range of up to 4096 bytes in the first
packet to be designated as "corrupt" for testing purposes.
Tweaked HTML-mode output formatting and added a key to the start of
the file to remind users of the color coding.
Switched storage of DecodedByte in files from three bytes to four
bytes in preparation for extension of reconstruction code to
other LZ77-based compression algorithms.
Extended search for reconstruction language models to look in the
current directory, a "models" subdirectory, the directory
containing the language identification database, and a
system-wide directory, e.g. /usr/share/ziprec/.
Updated valgrind header files to valgrind-3.7.0.
Fixes for GCC 4.6.3 warnings.
Added scripts for running evaluations on Europarl corpus.
v1.00alpha 2012-04-03:
Complete re-write of reconstruction code, now using longer n-grams
and eliminating the word-based reconstruction. This removes the
need to have a word-splitter that works on any given character
encoding and improves reconstruction of whitespace and
punctuation. The new reconstruction method is also three to five
times faster with the same or better accuracy.
Removed ziprec -r- option.