This would be a very nice additional feature (esp. for the clusters/lexical bundles reseachers), if there is a way to do it without eating up oodles of disk space on the mySQL tables.
N-grams are memory-hungry by default, but if you don't need direct links to every instance of an n-gram, you should take a closer look at the cwb-scan-corpus tool. It was designed for this purpose, and should be sufficient for smallish corpora (or subcorpora) of up to 50 M words. I've used it to calculate n-grams up to n=9 in the BNC on our server; with 16 GB RAM and not too many processes running at the same time, that's quite feasible.
In particular, pay attention to the "-R" option, which allows you to calculate n-grams on arbitrary subcorpora (for a key cluster analysis).
Displaying corpus examples of an n-gram will then require a separate CQP query, but with a little MU(...) trickery, this should still be fast enough for most users.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
N-grams are memory-hungry by default, but if you don't need direct links to every instance of an n-gram, you should take a closer look at the cwb-scan-corpus tool. It was designed for this purpose, and should be sufficient for smallish corpora (or subcorpora) of up to 50 M words. I've used it to calculate n-grams up to n=9 in the BNC on our server; with 16 GB RAM and not too many processes running at the same time, that's quite feasible.
In particular, pay attention to the "-R" option, which allows you to calculate n-grams on arbitrary subcorpora (for a key cluster analysis).
Displaying corpus examples of an n-gram will then require a separate CQP query, but with a little MU(...) trickery, this should still be fast enough for most users.