Stefan Evert - 2009-06-15

N-grams are memory-hungry by default, but if you don't need direct links to every instance of an n-gram, you should take a closer look at the cwb-scan-corpus tool. It was designed for this purpose, and should be sufficient for smallish corpora (or subcorpora) of up to 50 M words. I've used it to calculate n-grams up to n=9 in the BNC on our server; with 16 GB RAM and not too many processes running at the same time, that's quite feasible.

In particular, pay attention to the "-R" option, which allows you to calculate n-grams on arbitrary subcorpora (for a key cluster analysis).

Displaying corpus examples of an n-gram will then require a separate CQP query, but with a little MU(...) trickery, this should still be fast enough for most users.