We are happy to announce the availability of version 0.67 of the Ngram Statistics Package.
There are two interesting enhancements included in this version:
First, we have provided a utility program (huge-count.pl) that allows for
bigrams to be counted in very large corpora. We have run it on hundreds of millions of words in less than an hour using a "typical" desktop computer.
The method it employs is simple - it breaks the corpora into some number of pieces, counts each piece, and then combines them together.
In conjunction with this, we have modified statistic.pl so that it can process very large input files without difficulty. As a result, it is now
possible to count and carry out measures of association on very large corpora (such as the English GigaWord corpus) relatively quickly.
Second, we have rewritten the measure of association modules provided in the /Measures directory such that they share code. In previous versions there was considerable replication of code between the measures (such as ll.pm, pmi.pm, x2.pm, etc.) The common code for the 2d measures has been moved to /Measures/measure2d.pm, and for the 3d measures to /Measures/measure3d.pm.
This should make it much easier for users to write their own measures, and to also understand how the existing measures are implemented.