This would be a very nice additional feature (esp. for the clusters/lexical bundles reseachers), if there is a way to do it without eating up oodles of disk space on the mySQL tables.
N-grams are memory-hungry by default, but if you don't need direct links to every instance of an n-gram, you should take a closer look at the cwb-scan-corpus tool. It was designed for this purpose, and should be sufficient for smallish corpora (or subcorpora) of up to 50 M words. I've used it to calculate n-grams up to n=9 in the BNC on our server; with 16 GB RAM and not too many processes running at the same time, that's quite feasible.
In particular, pay attention to the "-R" option, which allows you to calculate n-grams on arbitrary subcorpora (for a key cluster analysis).
Displaying corpus examples of an n-gram will then require a separate CQP query, but with a little MU(...) trickery, this should still be fast enough for most users.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.