Menu

#14 CQPweb: N-grams

TODO-3.6
open
CQPweb (45)
3
2017-07-01
2009-06-15
No

This would be a very nice additional feature (esp. for the clusters/lexical bundles reseachers), if there is a way to do it without eating up oodles of disk space on the mySQL tables.

Discussion

  • Stephanie Evert

    Stephanie Evert - 2009-06-15

    N-grams are memory-hungry by default, but if you don't need direct links to every instance of an n-gram, you should take a closer look at the cwb-scan-corpus tool. It was designed for this purpose, and should be sufficient for smallish corpora (or subcorpora) of up to 50 M words. I've used it to calculate n-grams up to n=9 in the BNC on our server; with 16 GB RAM and not too many processes running at the same time, that's quite feasible.

    In particular, pay attention to the "-R" option, which allows you to calculate n-grams on arbitrary subcorpora (for a key cluster analysis).

    Displaying corpus examples of an n-gram will then require a separate CQP query, but with a little MU(...) trickery, this should still be fast enough for most users.

     
  • Andrew Hardie

    Andrew Hardie - 2017-07-01
    • Group: --> TODO-3.5
     
  • Andrew Hardie

    Andrew Hardie - 2017-07-01
    • Group: TODO-3.5 --> TODO-3.6
     

Log in to post a comment.

MongoDB Logo MongoDB