CMU Sphinx / Forums / Help: Fail to generate a valid language model

I successfully to build and run the confidence demo.
Then I wanna use my own transcriptions to generate the language model.
My transcriptions file is named "hello.transcription". The contents are:

Sometimes I think it is good
I disagree it

Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:

cat hello.transcript|./text2wfreq.exe |./wfreq2vocab.exe >hello.vocab
cat hello.transcript|./text2idngram.exe -n -vocab hello.vocab | ./idngram2lm.exe -absolute -n 3 -vocab hello.vocab -idngram - -arpa hello.lm

The generated language model "hello.lm"'s contents are:

Copyright (c) 1996, Carnegie Mellon University, Cambridge University,

Ronald Rosenfeld and Philip Clarkson

=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 7 words,
which begins "I", "Sometimes", "disagree"...
This is an OPEN-vocabulary model (type 1)
(OOVs were mapped to UNK, which is treated as any other vocabulary word)
Absolute discounting was applied.
1-gram discounting constant : 0.714286
2-gram discounting constant : 1
3-gram discounting constant : 1
This file is in the ARPA-standard format introduced by Doug Paul.

p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
else p(wd3|w2)

p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)

All probs and back-off weights (bo_wt) are given in log10 form.

Data formats:

Beginning of data mark: \data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams

\1-grams:
p_1 wd_1 bo_wt_1
\2-grams:
p_2 wd_1 wd_2 bo_wt_2
\3-grams:
p_3 wd_1 wd_2 wd_3

end of data mark: \end\

\data\
ngram 1=8
ngram 2=7
ngram 3=7

\1-grams:
-1.0607 <UNK> 0.0000
-0.4075 I 0.0830
-1.0607 Sometimes 0.2156
-1.0607 disagree 0.0000
-1.0607 good 0.2156
-1.0607 is 0.0395
-1.0607 it 0.0395
-1.0607 think 0.0395

\2-grams:
-99.9990 I disagree 0.0395
-99.9990 I think 0.0000
-99.9990 Sometimes I 0.0000
-99.9990 good I 0.0000
-99.9990 is good 0.0000
-99.9990 it is 0.0000
-99.9990 think it 0.0000

\3-grams:
-99.9990 I disagree it
-99.9990 I think it
-99.9990 Sometimes I think
-99.9990 good I disagree
-99.9990 is good I
-99.9990 it is good
-99.9990 think it is

\end\

Then I use the "hello.lm" to replace the "confidence.trigram.lm" in the confidence demo's configuration file.
But I got the following errors and the application terminates.

01:18.031 WARNING dictionary Missing word: <unk>
in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
01:18.249 WARNING dictionary Missing word: <unk>
in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
Exception in thread "main" java.lang.NullPointerException
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.getInitialSearchState(LexTreeLinguist.java:461)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:487)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:406)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:323)
at edu.cmu.sphinx.decoder.Decoder.allocate(Decoder.java:109)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:182)
at Confidence.main(Confidence.java:60)
Java Result: 1

When i switch back to the original lm file, it works well.
I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.
Thanks in advance.

> Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:

You are using rather old package

> I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.

a) your model has no <s> and </s> marks, they are required. You need to add <s> and </s> cues to the text before processing it
b) your model has <unk> tag, use -vocab_type 0 to generate closed vocabulary.

There is no ready to use reciept to get a model right now, the easiest way is to use quick_lm perl script or online lmtool. See also

http://www.voxforge.org/home/forums/message-boards/speech-recognition-engines/cmuslm-_arpa?pn=2

Fail to generate a valid language model

Speech Recognition Toolkit

Forums

Help

Fail to generate a valid language model document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Copyright (c) 1996, Carnegie Mellon University, Cambridge University,

Ronald Rosenfeld and Philip Clarkson

Fail to generate a valid language model