I have a database with 1 trillion words (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2006T13). I want to create a smaller size (50k vocab size) language model from this.
Method which I'm currently using is to select top 50k words (acc. to frequency) and delete all n-grams that contain ANY of the other (1T - 50k) words. and then create language model from all the remaining ngrams.
Is this a sensible method? Are there any better methods to do this task?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I have a database with 1 trillion words (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2006T13). I want to create a smaller size (50k vocab size) language model from this.
Method which I'm currently using is to select top 50k words (acc. to frequency) and delete all n-grams that contain ANY of the other (1T - 50k) words. and then create language model from all the remaining ngrams.
Is this a sensible method? Are there any better methods to do this task?
Yes