Download Latest Version (1.1 MB) Get Updates
Name Modified Size InfoDownloads / Week
LanguageModels 2013-05-09
OldVersions 2013-02-21
README 2013-05-09 3.9 kB 2013-05-09 1.1 MB 2013-02-21 1.1 MB 2013-02-21 1.2 MB
Totals: 6 Items   3.4 MB 3
The ZIP archives in this directory contain source code only. Due to their size, the language models for reconstruction are located in the subdirectory LanguageModels and the language identification models are not included with this release (get them from the Language-Aware Strings project, RECENT CHANGES ============== v1.00gamma 2013-05-07: Hotspot optimization reduced reconstruction time by about 25%. Avoiding recomputation of n-gram scores during incremental updates when the original computation did not contribute to a wildcard's overall score increased the speed-up to 35% relative to v1.00beta. Changed scoring function to eliminate an exp() in the innermost loop, increasing the speed-up to 50+% relative to v1.00beta with virtually identical reconstruction accuracy. Made "aggressive inference" (periodically assigning replacements for all wildcards with highly-skewed score distributions) the default, as it proved to improve both reconstruction accuracy and run time. Reversed the sense of the -r^ flag to allow the user to disable it. Initial implementation of a word-length model for automatically detecting DEFLATE stream corruption; added -r:l flag to enable its use. This approach proved unsuccessful in detecting corruption. Restored word-unigram model code from v0.9 and adapted it for use in detecting corruption; added -r:w to control its use. Fixed segfault while verifying a candidate RAR header when the header-size field produces a header size which extends beyond the end of the input file. Fixed test-mode reference matching to correctly handle a within-packet corruption when re-alignment across corruption is disabled. Ensure proper display of multiple newlines in HTML mode. v1.00beta 2013-02-13: Initial implementation of first phase of packet-end recovery. Search proved to be intractible in the general case, but usable when the Huffman trees are known (e.g. corruption in the middle of a packet). Implemented recovery of packets with corruption in the middle, including a search to re-align the decompressed data such that back-references across the corrupt region refer to the correct bytes. Added handling of zlib-style sync/flush markers as additional headers for finding DEFLATEd data. Refactored recovery code to use a list of DEFLATE packets, permitting multiple packets to contain corruption and enabling a user-specified corruption range in each packet. Updated -t flag to permit an arbitrary range of up to 4096 bytes in the first packet to be designated as "corrupt" for testing purposes. Tweaked HTML-mode output formatting and added a key to the start of the file to remind users of the color coding. Switched storage of DecodedByte in files from three bytes to four bytes in preparation for extension of reconstruction code to other LZ77-based compression algorithms. Extended search for reconstruction language models to look in the current directory, a "models" subdirectory, the directory containing the language identification database, and a system-wide directory, e.g. /usr/share/ziprec/. Updated valgrind header files to valgrind-3.7.0. Fixes for GCC 4.6.3 warnings. Added scripts for running evaluations on Europarl corpus. v1.00alpha 2012-04-03: Complete re-write of reconstruction code, now using longer n-grams and eliminating the word-based reconstruction. This removes the need to have a word-splitter that works on any given character encoding and improves reconstruction of whitespace and punctuation. The new reconstruction method is also three to five times faster with the same or better accuracy. Removed ziprec -r- option.
Source: README, updated 2013-05-09