Menu

The difference of the generated language between cmuclm toolkit online and offline

Help
2015-12-15
2015-12-15
  • Toan Nguyen

    Toan Nguyen - 2015-12-15

    Dear cmusphinx team,

    I already use the cmuclm toolkit online(
    http://www.speech.cs.cmu.edu/tools/lmtool-new.html) and offline(
    https://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/cmuclmtk/) to
    build the file test.txt attached in this post to the language model files.

    But there are lot of differences between the 2 generated language model
    files online_tool.lm and offline_tool respectively.

    Otherwise, the one generated from the online tool is used in Android
    PocketSphinx app gives the result much accurately, the other one always
    recognises wrong.

    Could you tell me how to use the offline tool to build the language model
    like the online one does.

    Below is my command lines I used via the offline one

    text2wfreq < test.txt | wfreq2vocab > test.vocab

    text2idngram -vocab test.vocab -idngram test.idngram < test.txt

    idngram2lm -vocab_type 0 -idngram test.idngram -vocab test.vocab -arpa
    test.lm
    Please review it and tell which parameters I should add to the 3 command
    lines above.

    Yours sincerely, Toan

     
    • Nickolay V. Shmyrev

      idngram2lm uses kndiscount by default. online tool uses absolute discount. You can use absolute discount with idngram2lm with -absolute option.

      You can also use perl script to train models identical to online models:

      http://www.speech.cs.cmu.edu/tools/download/quick_lm.pl

       
      • Toan Nguyen

        Toan Nguyen - 2015-12-15

        ok, thanks!

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.