CMU Sphinx / Forums / Help: The difference of the generated language between cmuclm toolkit online and offline

Speech Recognition Toolkit

The difference of the generated language between cmuclm toolkit online and offline

Forum: Help

Creator: Toan Nguyen

Created: 2015-12-15

Updated: 2015-12-15

Toan Nguyen - 2015-12-15

Dear cmusphinx team,

I already use the cmuclm toolkit online(
http://www.speech.cs.cmu.edu/tools/lmtool-new.html) and offline(
https://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/cmuclmtk/) to
build the file test.txt attached in this post to the language model files.

But there are lot of differences between the 2 generated language model
files online_tool.lm and offline_tool respectively.

Otherwise, the one generated from the online tool is used in Android
PocketSphinx app gives the result much accurately, the other one always
recognises wrong.

Could you tell me how to use the offline tool to build the language model
like the online one does.

Below is my command lines I used via the offline one

text2wfreq < test.txt | wfreq2vocab > test.vocab

text2idngram -vocab test.vocab -idngram test.idngram < test.txt

idngram2lm -vocab_type 0 -idngram test.idngram -vocab test.vocab -arpa
test.lm
Please review it and tell which parameters I should add to the 3 command
lines above.

Yours sincerely, Toan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-12-15
  
  idngram2lm uses kndiscount by default. online tool uses absolute discount. You can use absolute discount with idngram2lm with -absolute option.
  
  You can also use perl script to train models identical to online models:
  
  http://www.speech.cs.cmu.edu/tools/download/quick_lm.pl
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Toan Nguyen - 2015-12-15
    
    ok, thanks!
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The difference of the generated language between cmuclm toolkit online and offline

Speech Recognition Toolkit

Forums

Help

The difference of the generated language between cmuclm toolkit online and offline document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

The difference of the generated language between cmuclm toolkit online and offline