Creating a smaller size language model

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Creating a smaller size language model

Forum: Speech Recognition Theory

Creator: dovark

Created: 2013-09-17

Updated: 2013-09-17

dovark - 2013-09-17

Hello,

I have a database with 1 trillion words (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2006T13). I want to create a smaller size (50k vocab size) language model from this.

Method which I'm currently using is to select top 50k words (acc. to frequency) and delete all n-grams that contain ANY of the other (1T - 50k) words. and then create language model from all the remaining ngrams.

Is this a sensible method? Are there any better methods to do this task?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-09-17

Is this a sensible method?

Yes

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.