Menu

Fail to generate a valid language model

Help
Yueyu/Lin
2008-09-07
2012-09-22
  • Yueyu/Lin

    Yueyu/Lin - 2008-09-07

    I successfully to build and run the confidence demo.
    Then I wanna use my own transcriptions to generate the language model.
    My transcriptions file is named "hello.transcription". The contents are:

    Sometimes I think it is good
    I disagree it

    Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:

    cat hello.transcript|./text2wfreq.exe |./wfreq2vocab.exe >hello.vocab
    cat hello.transcript|./text2idngram.exe -n -vocab hello.vocab | ./idngram2lm.exe -absolute -n 3 -vocab hello.vocab -idngram - -arpa hello.lm

    The generated language model "hello.lm"'s contents are:

    Ronald Rosenfeld and Philip Clarkson

    =============================================================================
    =============== This file was produced by the CMU-Cambridge ===============
    =============== Statistical Language Modeling Toolkit ===============
    =============================================================================
    This is a 3-gram language model, based on a vocabulary of 7 words,
    which begins "I", "Sometimes", "disagree"...
    This is an OPEN-vocabulary model (type 1)
    (OOVs were mapped to UNK, which is treated as any other vocabulary word)
    Absolute discounting was applied.
    1-gram discounting constant : 0.714286
    2-gram discounting constant : 1
    3-gram discounting constant : 1
    This file is in the ARPA-standard format introduced by Doug Paul.

    p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
    else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
    else p(wd3|w2)

    p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
    else bo_wt_1(wd1)*p_1(wd2)

    All probs and back-off weights (bo_wt) are given in log10 form.

    Data formats:

    Beginning of data mark: \data\
    ngram 1=nr # number of 1-grams
    ngram 2=nr # number of 2-grams
    ngram 3=nr # number of 3-grams

    \1-grams:
    p_1 wd_1 bo_wt_1
    \2-grams:
    p_2 wd_1 wd_2 bo_wt_2
    \3-grams:
    p_3 wd_1 wd_2 wd_3

    end of data mark: \end\

    \data\
    ngram 1=8
    ngram 2=7
    ngram 3=7

    \1-grams:
    -1.0607 <UNK> 0.0000
    -0.4075 I 0.0830
    -1.0607 Sometimes 0.2156
    -1.0607 disagree 0.0000
    -1.0607 good 0.2156
    -1.0607 is 0.0395
    -1.0607 it 0.0395
    -1.0607 think 0.0395

    \2-grams:
    -99.9990 I disagree 0.0395
    -99.9990 I think 0.0000
    -99.9990 Sometimes I 0.0000
    -99.9990 good I 0.0000
    -99.9990 is good 0.0000
    -99.9990 it is 0.0000
    -99.9990 think it 0.0000

    \3-grams:
    -99.9990 I disagree it
    -99.9990 I think it
    -99.9990 Sometimes I think
    -99.9990 good I disagree
    -99.9990 is good I
    -99.9990 it is good
    -99.9990 think it is

    \end\

    Then I use the "hello.lm" to replace the "confidence.trigram.lm" in the confidence demo's configuration file.
    But I got the following errors and the application terminates.

    01:18.031 WARNING dictionary Missing word: <unk>
    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
    01:18.249 WARNING dictionary Missing word: <unk>
    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
    Exception in thread "main" java.lang.NullPointerException
    at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.getInitialSearchState(LexTreeLinguist.java:461)
    at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:487)
    at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:406)
    at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:323)
    at edu.cmu.sphinx.decoder.Decoder.allocate(Decoder.java:109)
    at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:182)
    at Confidence.main(Confidence.java:60)
    Java Result: 1

    When i switch back to the original lm file, it works well.
    I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.
    Thanks in advance.

     
    • Yueyu/Lin

      Yueyu/Lin - 2008-09-08

      Thanks a lot!
      But I still have a problem when I'm using the quick_lm perl script.
      I will always get a blank for 1-gram,such as

      \1-grams:
      -0.9542 -0.3010
      -1.2553 </s> -0.2499
      -1.2553 <s> -0.2499

      Although it doesn't matter and I can remove it manually. Is it a small bug of the script?

       
      • Nickolay V. Shmyrev

        I don't get it, are you sure you cleaned up the text properly. Probabaly something like carriage return was left.

         
    • Nickolay V. Shmyrev

      > Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:

      You are using rather old package

      > I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.

      a) your model has no <s> and </s> marks, they are required. You need to add <s> and </s> cues to the text before processing it
      b) your model has <unk> tag, use -vocab_type 0 to generate closed vocabulary.

      There is no ready to use reciept to get a model right now, the easiest way is to use quick_lm perl script or online lmtool. See also

      http://www.voxforge.org/home/forums/message-boards/speech-recognition-engines/cmuslm-_arpa?pn=2

       

Log in to post a comment.