IMS Open Corpus Workbench / Feature Requests / #14 CQPweb: N-grams

Indexing and query tools for very large text corpora

#14 CQPweb: N-grams

Milestone: TODO-3.6

Status: open

Owner: Andrew Hardie

Labels: CQPweb (45)

Priority: 3

Updated: 2017-07-01

Created: 2009-06-15

Creator: Andrew Hardie

Private: No

This would be a very nice additional feature (esp. for the clusters/lexical bundles reseachers), if there is a way to do it without eating up oodles of disk space on the mySQL tables.

Discussion

Stephanie Evert - 2009-06-15

N-grams are memory-hungry by default, but if you don't need direct links to every instance of an n-gram, you should take a closer look at the cwb-scan-corpus tool. It was designed for this purpose, and should be sufficient for smallish corpora (or subcorpora) of up to 50 M words. I've used it to calculate n-grams up to n=9 in the BNC on our server; with 16 GB RAM and not too many processes running at the same time, that's quite feasible.

In particular, pay attention to the "-R" option, which allows you to calculate n-grams on arbitrary subcorpora (for a key cluster analysis).

Displaying corpus examples of an n-gram will then require a separate CQP query, but with a little MU(...) trickery, this should still be fast enough for most users.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Hardie - 2017-07-01

Group: --> TODO-3.5
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Hardie - 2017-07-01

Group: TODO-3.5 --> TODO-3.6
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CQPweb: N-grams

Indexing and query tools for very large text corpora

Group

Searches

Help

#14 CQPweb: N-grams

Discussion