Menu

Creating a smaller size language model

dovark
2013-09-17
2013-09-17
  • dovark

    dovark - 2013-09-17

    Hello,

    I have a database with 1 trillion words (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2006T13). I want to create a smaller size (50k vocab size) language model from this.

    Method which I'm currently using is to select top 50k words (acc. to frequency) and delete all n-grams that contain ANY of the other (1T - 50k) words. and then create language model from all the remaining ngrams.

    Is this a sensible method? Are there any better methods to do this task?

     
  • Nickolay V. Shmyrev

    Is this a sensible method?

    Yes

     

Log in to post a comment.