I successfully to build and run the confidence demo.
Then I wanna use my own transcriptions to generate the language model.
My transcriptions file is named "hello.transcription". The contents are:
Sometimes I think it is good
I disagree it
Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:
=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 7 words,
which begins "I", "Sometimes", "disagree"...
This is an OPEN-vocabulary model (type 1)
(OOVs were mapped to UNK, which is treated as any other vocabulary word)
Absolute discounting was applied.
1-gram discounting constant : 0.714286
2-gram discounting constant : 1
3-gram discounting constant : 1
This file is in the ARPA-standard format introduced by Doug Paul.
\1-grams:
-1.0607 <UNK> 0.0000
-0.4075 I 0.0830
-1.0607 Sometimes 0.2156
-1.0607 disagree 0.0000
-1.0607 good 0.2156
-1.0607 is 0.0395
-1.0607 it 0.0395
-1.0607 think 0.0395
\2-grams:
-99.9990 I disagree 0.0395
-99.9990 I think 0.0000
-99.9990 Sometimes I 0.0000
-99.9990 good I 0.0000
-99.9990 is good 0.0000
-99.9990 it is 0.0000
-99.9990 think it 0.0000
\3-grams:
-99.9990 I disagree it
-99.9990 I think it
-99.9990 Sometimes I think
-99.9990 good I disagree
-99.9990 is good I
-99.9990 it is good
-99.9990 think it is
\end\
Then I use the "hello.lm" to replace the "confidence.trigram.lm" in the confidence demo's configuration file.
But I got the following errors and the application terminates.
01:18.031 WARNING dictionary Missing word: <unk>
in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
01:18.249 WARNING dictionary Missing word: <unk>
in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
Exception in thread "main" java.lang.NullPointerException
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.getInitialSearchState(LexTreeLinguist.java:461)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:487)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:406)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:323)
at edu.cmu.sphinx.decoder.Decoder.allocate(Decoder.java:109)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:182)
at Confidence.main(Confidence.java:60)
Java Result: 1
When i switch back to the original lm file, it works well.
I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.
Thanks in advance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:
You are using rather old package
> I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.
a) your model has no <s> and </s> marks, they are required. You need to add <s> and </s> cues to the text before processing it
b) your model has <unk> tag, use -vocab_type 0 to generate closed vocabulary.
There is no ready to use reciept to get a model right now, the easiest way is to use quick_lm perl script or online lmtool. See also
I successfully to build and run the confidence demo.
Then I wanna use my own transcriptions to generate the language model.
My transcriptions file is named "hello.transcription". The contents are:
Sometimes I think it is good
I disagree it
Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:
cat hello.transcript|./text2wfreq.exe |./wfreq2vocab.exe >hello.vocab
cat hello.transcript|./text2idngram.exe -n -vocab hello.vocab | ./idngram2lm.exe -absolute -n 3 -vocab hello.vocab -idngram - -arpa hello.lm
The generated language model "hello.lm"'s contents are:
Copyright (c) 1996, Carnegie Mellon University, Cambridge University,
Ronald Rosenfeld and Philip Clarkson
=============================================================================
=============== This file was produced by the CMU-Cambridge ===============
=============== Statistical Language Modeling Toolkit ===============
=============================================================================
This is a 3-gram language model, based on a vocabulary of 7 words,
which begins "I", "Sometimes", "disagree"...
This is an OPEN-vocabulary model (type 1)
(OOVs were mapped to UNK, which is treated as any other vocabulary word)
Absolute discounting was applied.
1-gram discounting constant : 0.714286
2-gram discounting constant : 1
3-gram discounting constant : 1
This file is in the ARPA-standard format introduced by Doug Paul.
p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
else p(wd3|w2)
p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)
All probs and back-off weights (bo_wt) are given in log10 form.
Data formats:
Beginning of data mark: \data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams
\1-grams:
p_1 wd_1 bo_wt_1
\2-grams:
p_2 wd_1 wd_2 bo_wt_2
\3-grams:
p_3 wd_1 wd_2 wd_3
end of data mark: \end\
\data\
ngram 1=8
ngram 2=7
ngram 3=7
\1-grams:
-1.0607 <UNK> 0.0000
-0.4075 I 0.0830
-1.0607 Sometimes 0.2156
-1.0607 disagree 0.0000
-1.0607 good 0.2156
-1.0607 is 0.0395
-1.0607 it 0.0395
-1.0607 think 0.0395
\2-grams:
-99.9990 I disagree 0.0395
-99.9990 I think 0.0000
-99.9990 Sometimes I 0.0000
-99.9990 good I 0.0000
-99.9990 is good 0.0000
-99.9990 it is 0.0000
-99.9990 think it 0.0000
\3-grams:
-99.9990 I disagree it
-99.9990 I think it
-99.9990 Sometimes I think
-99.9990 good I disagree
-99.9990 is good I
-99.9990 it is good
-99.9990 think it is
\end\
Then I use the "hello.lm" to replace the "confidence.trigram.lm" in the confidence demo's configuration file.
But I got the following errors and the application terminates.
01:18.031 WARNING dictionary Missing word: <unk>
in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
01:18.249 WARNING dictionary Missing word: <unk>
in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
Exception in thread "main" java.lang.NullPointerException
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.getInitialSearchState(LexTreeLinguist.java:461)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:487)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:406)
at edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager.allocate(WordPruningBreadthFirstSearchManager.java:323)
at edu.cmu.sphinx.decoder.Decoder.allocate(Decoder.java:109)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:182)
at Confidence.main(Confidence.java:60)
Java Result: 1
When i switch back to the original lm file, it works well.
I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.
Thanks in advance.
Thanks a lot!
But I still have a problem when I'm using the quick_lm perl script.
I will always get a blank for 1-gram,such as
\1-grams:
-0.9542 -0.3010
-1.2553 </s> -0.2499
-1.2553 <s> -0.2499
Although it doesn't matter and I can remove it manually. Is it a small bug of the script?
I don't get it, are you sure you cleaned up the text properly. Probabaly something like carriage return was left.
> Then I use the CMU_Cam_Toolkit to generate my own language model using the following scripts:
You are using rather old package
> I must make some mistakes to generate the language model. But I just followed the instructions. Can someone help me to point out where I'm wrong.
a) your model has no <s> and </s> marks, they are required. You need to add <s> and </s> cues to the text before processing it
b) your model has <unk> tag, use -vocab_type 0 to generate closed vocabulary.
There is no ready to use reciept to get a model right now, the easiest way is to use quick_lm perl script or online lmtool. See also
http://www.voxforge.org/home/forums/message-boards/speech-recognition-engines/cmuslm-_arpa?pn=2