#14 CQPweb: N-grams

Andrew Hardie
CQPweb (34)
Andrew Hardie

This would be a very nice additional feature (esp. for the clusters/lexical bundles reseachers), if there is a way to do it without eating up oodles of disk space on the mySQL tables.


  • Stefan Evert
    Stefan Evert

    N-grams are memory-hungry by default, but if you don't need direct links to every instance of an n-gram, you should take a closer look at the cwb-scan-corpus tool. It was designed for this purpose, and should be sufficient for smallish corpora (or subcorpora) of up to 50 M words. I've used it to calculate n-grams up to n=9 in the BNC on our server; with 16 GB RAM and not too many processes running at the same time, that's quite feasible.

    In particular, pay attention to the "-R" option, which allows you to calculate n-grams on arbitrary subcorpora (for a key cluster analysis).

    Displaying corpus examples of an n-gram will then require a separate CQP query, but with a little MU(...) trickery, this should still be fast enough for most users.